Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
what I understand, > tesseract requires on the order of 10K images and box files to train on. > However, unless I am missing something, what I read at > https://github.com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at

Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar > wrote: > >> Have you looked at >> >> https://github.com/tesseract-ocr/tesstrain >> >> >> >> On

Re: [tesseract-ocr] How to generate training images with noise

2023-10-12 Thread Shree Devi Kumar
Have you looked at https://github.com/tesseract-ocr/tesstrain On Thu, Oct 12, 2023, 11:45 PM Keith Smith wrote: > Hello, > > I am trying to use tesseract to OCR the MICR line of checks (i.e. the > micr-e13b font). The training data that I found at >

Re: [tesseract-ocr] Does the checkpoint_name contain the number of iterations

2023-09-20 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints On Wed, Sep 20, 2023, 2:53 AM Des Bw wrote: > I couldn't understand what the numbers on the checkpoint_names are. > I looked at this one: but clear to me. > >

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
ectory of components from the .traineddata file. *-l* *.traineddata* *FILE*...: List the network information. On Sat, Sep 16, 2023, 2:11 PM Shree Devi Kumar wrote: > The language name headings seem to be missing from the tessdoc page for > tessdata_fast > > Please revert to an older ve

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
The language name headings seem to be missing from the tessdoc page for tessdata_fast Please revert to an older version of page from history On Sat, Sep 16, 2023, 2:08 PM Shree Devi Kumar wrote: > > https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md >

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
, iteration=6112200, sample_iteration=6112270, null_char=284, learning_rate=0.001, momentum=0.5, adam_beta=0.999 On Fri, Sep 15, 2023, 9:50 PM Des Bw wrote: > For the last couple of days, I have been trying to train the amh data to > include some missing characters. > > I have see

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-08-09 Thread Shree Devi Kumar
Include the default fonts also in your fine-tuning list of fonts and see if that helps. On Wed, Aug 9, 2023, 2:27 PM Ali hussain wrote: > I have trained some new fonts by fine-tune methods for the Bengali > language in Tesseract 5 and I have used all official trained_text and > tessdata_best

Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Shree Devi Kumar
Aurebesh seems to be different symbols mapped to the English alphabet rather than a new font for English, hence training would need to be for a new language rather than just fine-tuning. On Sat, Apr 1, 2023, 10:47 Ali Abedian wrote: > Hello, > > Thank you for providing the references, but I'm

Re: [tesseract-ocr] Re: Kurdish traineddata

2022-10-17 Thread Shree Devi Kumar
ilable on > > https://github.com/KurdishBLARK/KurdishOCR > > On Sun, Oct 16, 2022 at 20:59 Shree Devi Kumar > wrote: > >> Thank you for sharing information regarding successful training of >> Kurdish traineddata for Tesseract. >> >> Please also let us know

[tesseract-ocr] Re: Kurdish traineddata

2022-10-16 Thread Shree Devi Kumar
Thank you for sharing information regarding successful training of Kurdish traineddata for Tesseract. Please also let us know whether the traineddata is available for others to use. You may want to contribute to the tess_contrib repo. Let us know whether the recognition covers 0-9 digits in

Re: [tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread Shree Devi Kumar
Have you tried instructions on https://tesseract-ocr.github.io/tessdoc/Installation.html On Sun, Apr 3, 2022, 22:08 'Peter Kronenberg' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Has anyone had any luck installing Tesseract 5 on Linux? It doesn’t seem > to be available in any

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-04-01 Thread Shree Devi Kumar
ionic main" and paste it as shown below on the next line. If you are using a different release of ubuntu, then replace bionic with the respective release name. deb http://archive.ubuntu.com/ubuntu bionic universe On Fri, Apr 1, 2022, 11:49 Shree Devi Kumar wrote: > https://packages.ubu

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-04-01 Thread Shree Devi Kumar
https://packages.ubuntu.com/focal/libleptonica-dev On Fri, Apr 1, 2022, 11:07 polki paul wrote: > Hello, > > how to install libleptonica-dev on Ubuntu 20.04 ? > > > *sudo apt-get updatesudo apt-get install libleptonica-dev* > > > > > *Reading package lists... DoneBuilding dependency treeReading

Re: [tesseract-ocr] Tesseract 4.0 - Multiline text

2022-03-23 Thread Shree Devi Kumar
Use the hocr option. On Thu, Mar 24, 2022, 10:52 Muraliraj DK wrote: > I am not sure if you have looked at the image. What i meant on Multi line > text is when the sentence is wrapped to next line i would like to extract > as single sentence instead of 2 lines (paragraph). > > Single line is -

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
I have also posted in vcpkg repo for them to update the official package to 5.0.0. https://github.com/microsoft/vcpkg/issues/16019 On Sat, Jan 1, 2022, 17:20 Shree Devi Kumar wrote: > You can download windows binaries from > https://github.com/UB-Mannheim/tesseract/wiki > > > &g

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
You can download windows binaries from https://github.com/UB-Mannheim/tesseract/wiki On Sat, Jan 1, 2022, 16:54 杜德銘 wrote: > the original : > > vcpkg install tesseract:x64-windows for 64-bit. Use –head for the master > branch. > > is not 5.0, is 4.1. > > can update this command? > > reply by

Re: [tesseract-ocr] training scripts in 5.0.0

2021-12-04 Thread Shree Devi Kumar
Please see the tesstrain repo. Python version of tesstrain.sh etc have been moved there. On Sat, Dec 4, 2021, 22:37 Marco Atzeri wrote: > Hi, > > I am updating the cygwin package from 4.1.1 to 5.0.0 > and I noticed that 3 scripts > >language-specific.sh >tesstrain.sh >

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-21 Thread Shree Devi Kumar
Also see the Technical Information section in https://tesseract-ocr.github.io/tessdoc/ On Mon, Nov 22, 2021, 01:36 Peter Geraghty wrote: > Thank you!!! will do! > > On Sunday, November 21, 2021 at 12:51:51 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-20 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesstrain/wiki for detailed examples of tesseract training for handwritten texts. On Sat, Nov 20, 2021 at 11:53 AM Peter Geraghty wrote: > sorry, by word recognition, I meant word and character localization. > > On Friday, November 19, 2021 at

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/3331#issuecomment-946532564 On Tue, Oct 19, 2021, 16:26 juan carlos hernández < juan.carlos.h.c.valen...@gmail.com> wrote: > Hello > I'm working in a project that needs OCR and we have choosed to use > Tesseract. We would like to use v5.0.0,

Re: [tesseract-ocr] What are Langdata repository given for retraining Tesseract

2021-04-15 Thread Shree Devi Kumar
Use langdata_lstm repo for LSTM training. That has larger training text. On Thu, Apr 15, 2021, 00:52 Venkatapathy S wrote: > Hi, > I want to retrain Tesseract from the scratch for a particular language(I > have read as many resources as possible, including warnings, from the > Tutorial

Re: [tesseract-ocr] What do iteration numbers mean in the train logging?

2021-04-14 Thread Shree Devi Kumar
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints Epoch size depends on your training data. If you have 1000 lines of training data, then 1 epoch is 1000 iterations. If you have 5 lines of training text, 1 epoch is 5 iterations. On Wed,

Re: [tesseract-ocr] What is Max Iterations & Epochs in tesstrain Makefile

2021-04-14 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#lstmtraining-command-line Epoch has been recently added to the tesstrain makefile and converts to number of iterations based on amount of training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > Hi All, > > I

Re: [tesseract-ocr] Unable to understand Iterations?

2021-04-14 Thread Shree Devi Kumar
It has seen only 600 lines of data of which only 300 have been used for learning. Iterations are different from an epoch which is going through all training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > What does *At Iteration 300/600/600.* > > Let's assume I have 10k data and I

Re: [tesseract-ocr] Lstm training parameters

2021-04-04 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_best.md On Mon, Apr 5, 2021, 00:52 Adriana Camilleri wrote: > By any chance, is there any information out there about the training >

Re: [tesseract-ocr] tesseract WIndows 10 Newbie here

2021-04-02 Thread Shree Devi Kumar
Check that the tesseract directories is added to Path so that it can be found. On Fri, Apr 2, 2021, 11:03 Gianfranco Dy wrote: > One of the requirements of alimranahmed/LaraOCR is that the Tesseract > command is accessible > > [image: Capture.PNG] > after installing

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-30 Thread Shree Devi Kumar
so much! > > What hyperparameters did you use for training? number of pages? epochs? > > Which model did you start with? your file seems smaller than other > eng.traineddata files. > > Thanks, > ~Marvin > > On Sun, Mar 28, 2021 at 10:16 AM Shree Devi Kumar > wrote: >

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-27 Thread Shree Devi Kumar
Do you have the font used in the sample? Do you only need to recognise numbers in it? On Sat, Mar 27, 2021, 16:10 Marvin Thielk wrote: > I've tried a variety of pre-processing attempts and different configs, but > this feels like it should be an easy detection task. > > I've tried with several

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread shree
See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc //Get OSD - new code int orient_deg; float orient_conf; const char* script_name; float script_conf; api->DetectOrientationScript(_deg, _conf, _name, _conf); printf("\n Orientation

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Shree Devi Kumar
Try with newer version of tesseract. On Thu, Mar 25, 2021, 13:19 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> wrote: > Hi Every one. > I am using pytesseract with tesseract-ocr version 3.05.02 for conversion > of scanned pdf document of 1000k pages to searchable pdf document but my

Re: [tesseract-ocr] Installing tesseract 5 via vcpkg

2021-03-24 Thread Shree Devi Kumar
Yes, -head doesn't work with vcpkg. You can install the dependencies via vcpkg and then build tessaract. See https://github.com/tesseract-ocr/tesseract/actions/runs/681261367/workflow for the steps On Wed, Mar 24, 2021, 19:44 Fábio Ramos wrote: > Hello, I've tried using vcpkg to install

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
@AlexanderP/tesseract-debian Is there a way to use older ppa versions? On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
Please report as issue in tesseract repo. On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was working fine with my images. >

Re: [tesseract-ocr] Properly Insert OCR Into Separate Columns

2021-03-22 Thread Shree Devi Kumar
Please see the newly added table detector to the master branch https://github.com/tesseract-ocr/tesseract/pull/3330 On Mon, Mar 22, 2021, 10:53 Daniel Lu wrote: > Hi, > > I am trying to read hundreds of pages of information like the picture > below into a CSV file. For us humans, it is very

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-20 Thread Shree Devi Kumar
avinash singh wrote: > Hello Shree, > > Thank you for your reply, > > We have used tesseract 4.0 alpha > > The Training Data is used from the below > > https://github.com/tesseract-ocr/tessdata_best > > https://tesseract-ocr.github.io/tessdoc/Data-Files.html >

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-15 Thread shree
See attached image from a screenshot of Malayalam wiki and the OCRed text using traineddata from tessdata_best, tessdata_fast and tessdata To me it seems like recognition is 90+% correct. On Sunday, March 14, 2021 at 6:09:17 AM UTC+5:30 shree wrote: > You have not stated the vers

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-13 Thread Shree Devi Kumar
You have not stated the version of tesseract that you are using. >We downloaded some online training data available for the language Malayalam You have not mentioned from where you got it. Are these the official traineddata files? >we found that few special characters in the language are not

Re: [tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-24 Thread Shree Devi Kumar
Yes. Usage for compacting LSTM component to int: combine_tessdata -c traineddata_file On Wed, Feb 24, 2021 at 10:56 PM Jennil Thiyam wrote: > HI shree, so by running this command, the model will be in its > integer/fast version? > > On Wed, Feb 24, 2021 at 10:27 AM shree wro

[tesseract-ocr] Re: New release for tessdata_{fast,best}?

2021-02-23 Thread shree
>There is now a 4.1.0 release available for tessdata_fast, tessdata and tessdata_best. See https://github.com/tesseract-ocr/tessdata_fast/issues/26#issuecomment-780127901 @Merlijn Wajer archive.org has many books which use English with diacritics for Sanskrit (IAST). You could try the models

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2021-02-23 Thread shree
> > Please try the models from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST >>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-23 Thread shree
You can create an integer/fast version of traineddata which cannot be used as START_MODEL for further training. `combine_tessdata -c myfile.traineddata` On Monday, February 22, 2021 at 3:58:19 PM UTC+5:30 thiyam...@gmail.com wrote: > Does anyone have any idea about making the traineddata file

Re: [tesseract-ocr] Tesseract output text and symbol

2021-02-18 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-page-separators-are-used-in-txt-output-by-tesseract-400 On Thu, Feb 18, 2021, 12:15 J Cassar wrote: > Good Day, > > I've used tesseract on a number of jpeg images ( see input image attached) > and it works fine as it outputs the text.

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
haring >> this is my notebook you can see complete process in finetune 2 section. >> >> >> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote: >> >>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani >>> wrote: >>> >>>> hi, &

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani wrote: > hi, > i have tried minus 1 and got following result > Iteration 0: GROUND TRUTH : ) @® > Iteration 0: BEST OCR TEXT : Yo > File eng.arial.exp0.lstmf line 0 : > > What's your version of tesseract? What o/s? > Without your files, it's

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
for server... > Error: Unable to access jarfile ./ScrollView.jar > sh: 1: kill: No such process > On Friday, February 5, 2021 at 4:28:14 PM UTC+5:30 shree wrote: > >> Add the following to your lstmtraining command and see. >> --debug_interval -1 >> >> >> >

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
Add the following to your lstmtraining command and see. --debug_interval -1 On Fri, Feb 5, 2021 at 4:05 PM Kumar Rajwani wrote: > HI, > i am trying to finetune eng.traindata as per my images i have tried to > train but all time i am stuck somewhere can you tell me how can i procced > further.

Re: [tesseract-ocr] Training tesseract, APPLY_BOXES: ... FAILURE! Couldn't find a matching blob for BENGALI language.

2021-01-28 Thread Shree Devi Kumar
For Bengali, you need to train the LSTM model. Legacy model training won't work. On Thu, Jan 28, 2021, 22:32 Boring Guy69 wrote: > > Hello i am new to tesseract. i am working on bengali language [kalpurush > font]. > I got lots of error when i make TR files. if i describe my work flow > At

Re: {EXTERNAL}[tesseract-ocr] Installing tessdata

2021-01-27 Thread Shree Devi Kumar
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files.html Also the readme files in the three repos https://github.com/tesseract-ocr/tessdata_fast On Thu, Jan 28, 2021, 03:20 Peter Kronenberg wrote: > Hi, can someone help with these questions? Just trying to understand > better how

Re: [tesseract-ocr] New release for tessdata_{fast,best}?

2021-01-27 Thread Shree Devi Kumar
>The Internet Archive has switched to using Tesseract for all our OCR, I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive. Is there any page with instructions to do this? Can a

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2021-01-19 Thread Shree Devi Kumar
>*wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata * That is not correct. You need to get the `raw` file. https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata *wget

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-12 Thread Shree Devi Kumar
oding string problem. I wonder if it's a problem with the unicharset > extractor? > On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for >> updates >> >> On Saturday, January 9, 20

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-11 Thread shree
ext but I want to pass in my > own unicharset file. > On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote: > >> Are any of these vertical fonts? >> >> Encoding errors could be if the characters in training text are not in >> the unicharset. >&

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
for the `tessdoc` repo. On Fri, Jan 8, 2021 at 9:05 PM Keith wrote: > Shree, > > Thank you for your reply. I should have gone to bed (it was like 2 AM my > time on a work night) instead of continuing to bang my head. > > When I saw your message this morning, I was thinking, "

Re: [tesseract-ocr] Japanese - Problems with vertical words

2021-01-08 Thread shree
ed jpn_vert >> >> https://github.com/zodiac3539/jpn_vert >> >> >> On Mon, Jun 3, 2019 at 11:31 AM Shree Devi Kumar >> wrote: >> >>> tesseract 4 has been trained on line images and hence gives better >>> results for lines, as far a

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
>After placing the groundtruth files in a folder called *data/foo-ground-truth* inside the main *tesseract *repo folder, data/foo-ground-truth needs to be under the tesstrain folder not tesseract folder. You can use ground-truth in a different location, in that case you have to refer to it

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
ther 2 errors are occurring? > On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: > >> Your training text file is only 175 lines, so the rendered image fits in >> 4 pages. You need to use a larger text if you want more pages. >> >> Also check that your fon

Re: [tesseract-ocr] Easily readable Russian not recognized in language app screenshot

2021-01-07 Thread Shree Devi Kumar
, no-noise PNGs—and what could be done about it. >> >> On Thursday, October 8, 2020 at 7:08:28 AM UTC+2 shree wrote: >> >>> Give each region of interest separately. >>> >>> >>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
; > On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: > >> Old versions of tesstrain.sh used to limit training to 3 pages. Looks >> like you may have an old version in the path somewhere. >> >> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: >> >>

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
Or you may have an old version of data/ben/checkpoints/ben_checkpoint -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere. On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: > I have a script to train tesseract and I ran it on Arch Linux, Debian, and > even a docker container and they all

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
> ranjansou...@gmail.com> wrote: > >> Hi Shree, >> >> I installed the bidi module. The error went away, but the training does >> not happen again. Please find the log and training script attached. >> FYI I am using the makefile from the master branch. Do I n

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
hm > ModuleNotFoundError: No module named 'bidi' > Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box' failed > make: *** [data/ben-ground-truth/24-022.box] Error 1 > > I should mention I double checked the 24-022.gt.txt and 24-022.tif files > and both of them are va

[tesseract-ocr] Re: Tesseract v 5.0 on Linux

2021-01-02 Thread shree
In case you can install debian packages - see https://notesalexp.org/tesseract-ocr/ On Friday, January 1, 2021 at 12:59:46 AM UTC+5:30 peter.kr...@torch.ai wrote: > > Is there a way to get Tesseract 5.0 on Linux without building it myself? > I'm on Alpine Linux. apk add only gets me 4.0 > >

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
data (not seen by lstmtraining either for training or eval, shows an improvement over both ben and script/Bengali. To improve results further, check groundtruth transcription for any missing words, normalize the text and try with some more training data. On Fri, Jan 1, 2021 at 6:41 PM Shree Devi

[tesseract-ocr] Re: advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-01 Thread shree
Please see old thread at https://groups.google.com/g/tesseract-ocr/c/ApM_TqwV7aE/m/z5jZV0I0AgAJ for link to a completed project for dot matrix On Monday, December 14, 2020 at 12:11:00 PM UTC+5:30 Keith M wrote: > Hi there, > > I've been circling a problem with OCR'ing 90-pages of 30 year old

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
Shreeshrii, > > Can you please tell me the training command used? Also, how can I create > the graphs and these other documents? > > On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, wrote: > >> Soumik, >> >> I used your groundtruth and trained using ben as the ST

Re: [tesseract-ocr] Tesseract Performance

2020-12-24 Thread Shree Devi Kumar
>testing an unseen image, the performance was exactly the same. Can you share the image (preferably a page) and expected result? On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta < ranjansou...@gmail.com> wrote: > Hi everyone, > I wanted to do fine-tune the ben.traineddata model by using

Re: [tesseract-ocr] Diacriticals Training

2020-12-14 Thread Shree Devi Kumar
Appreciate your offer to help and provide feedback as well as training data. Let me try to answer your queries: 1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference? san has been trained for Sanskrit. But it is missing certain Devanagari

Re: [tesseract-ocr] tesseract-ocr for train persian language

2020-12-11 Thread Shree Devi Kumar
I don't think jTessBoxEditor supports RTL languages like Persian. You can try using tesstrain.sh On Fri, Dec 11, 2020 at 8:57 PM alireza m wrote: > hi i want to train persian language by b nazanin font but jTessBoxEditor > doesn't have b nazanin font how can i add new font??? > > -- > You

Re: [tesseract-ocr] Diacriticals Training

2020-12-11 Thread shree
nks in advance to all those training Tesseract along these lines. > > Greg > > On Thursday, December 3, 2020 at 3:16:19 AM UTC-10 shree wrote: > >> 1. git clone https://github.com/Shreeshrii/tesstrain-sanPlusMinus >> 2. cd tesstrain-sanPlusMinus >> 3. nohup make train

Re: [tesseract-ocr] Diacriticals Training

2020-12-03 Thread shree
he last command which starts training, change the TESSDATA directory to point to wherever you have the tessdata_best/san.traineddata model. On Monday, November 30, 2020 at 8:55:54 PM UTC+5:30 advoca...@gmail.com wrote: > Shree I have gone through it, but I might need proper workflow to &g

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
il_source=link_campaign=sig-email_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Thu, Nov 12, 2020 at 4:08 PM Shree Devi Kumar wrote: > Please see tesseract-ocr/tesstrain repo > > You need line images and their groundtruth text and the makefile will make >

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
for 4.0. you can try plusminus or replace top layer type of training. For good results you need a lot of training data, eg. 5 text lines. On Thu, Nov 12, 2020, 12:21 shreyansh dwivedi wrote: > Hello shree, > Than, what is the way to train the sanskrit along with roman diacr

Re: [tesseract-ocr] Low tesseract accuracy

2020-11-11 Thread Shree Devi Kumar
Suggest you pre-process images instead of training. See https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html On Tue, Nov 10, 2020 at 12:14 PM Dinesh Yakkanti wrote: > Hello Everyone, >I am trying to build custom tesseract-ocr model. I am getting > high error rate even if i

Re: [tesseract-ocr] Diacriticals Training

2020-11-05 Thread Shree Devi Kumar
. ,. shapetable, tr etc are all files for legacy engine, 3.0x and before. It is supported in tesseract4 with --oem 0 On Thu, Nov 5, 2020, 17:14 Shree Devi Kumar wrote: > Are you trying to train for the legacy tesseract engine? > > On Thu, Nov 5, 2020, 16:46 shreyansh dwivedi > wrote

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-11-01 Thread Shree Devi Kumar
ple image. I believe we are using the legacy > engine. Does this help? > > On Saturday, October 31, 2020 at 11:15:46 PM UTC-4 shree wrote: > >> >When we use tesseract on the images without the trained language we >> receive outputs that are accurate about 50% of the time. >

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
>When we use tesseract on the images without the trained language we receive outputs that are accurate about 50% of the time. You haven't shared a sample image. Sometimes preprocessing the images, using a whitelist in case of limited character set can be the solution rather than training. On

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
Are you trying to train for the legacy tesseract engine? On Sun, Nov 1, 2020, 03:29 Cailey McVay wrote: > Hello! > I am working on a project that is trying to read borehole video depths. We > trained a new language to read these numbers called NTS. When we use > tesseract on the images without

Re: [tesseract-ocr] Fwd: Training tesseract OCR

2020-10-31 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain On Sat, Oct 31, 2020 at 9:54 AM bosh sherikar wrote: > Please Reply back > > -- Forwarded message - > From: bosh sherikar > Date: Tue, Oct 13, 2020 at 10:42 PM > Subject: Training tesseract OCR > To: > > > Dear community, > > I

Re: [tesseract-ocr] add new characters

2020-10-27 Thread shree
gt;> Found AVX >> Found SSE >> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 >> liblz4/1.9.2 libzstd/1.4.4 >> >> Many thanks again for your fast help >> >> On Saturday, October 24, 2020 at 3:12:15 PM UTC+2 shree wrote: >> >>

Re: [tesseract-ocr] add new characters

2020-10-24 Thread Shree Devi Kumar
Ray has suggested using plus-minus type of training for adding a couple of characters to the traineddata. Did you try that? Please share the training data you used (box/tiff pairs or lstmf files). I have done replace a layer training for Sanskrit. It adds the two characters you want (in addition

Re: [tesseract-ocr] training doubts

2020-10-20 Thread Shree Devi Kumar
For English, most of the times, preprocessing your images and using official traineddata will give better results than trying to do training. For finetuning, ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#fine-tuning-for-impact) what is recommended is using the existing

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-11 Thread Shree Devi Kumar
Tesseract will make a checkpoint, if needed, every 100 iterations, so I suggest a minimum 50-100 line images to test finetuning. Also, one of your image samples has a lot of noise on the right side. Crop all extra parts. Also for `ben` you should choose the Indic language option in tesstrain. On

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-10 Thread Shree Devi Kumar
have a substantial > amount of images and then process and produce the line image and ground > truth from it- will that help me in improving the detection? > > On Sunday, September 27, 2020 at 9:21:17 PM UTC+6 Grad wrote: > >> @shree thank you for the advice, it was helpful.

Re: [tesseract-ocr] Diacriticals Training

2020-10-08 Thread Shree Devi Kumar
; <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar wrote: > I am currently running a training run based on synthetic training data for > Sanskrit to support both Devanagari script with vedic accents as well as > iAST (Roman with diacri

Re: [tesseract-ocr] PyTesseract not recognizing decimal points

2020-10-06 Thread Shree Devi Kumar
Have you tried cropping the image to remove the arrowhead to see if that improves the result? On Tue, Oct 6, 2020 at 9:42 AM Andrew wrote: > As per my question on StackOverflow: PyTesseract not recognizing decimals >

Re: [tesseract-ocr] Tesseract failing for very clear image

2020-10-05 Thread Shree Devi Kumar
Try to add a little bit of white border to image and see. Try --psm 6 On Mon, Oct 5, 2020, 11:00 Guillaume Bersac wrote: > Hello, > > ### Environment > **Tesseract Version**: > tesseract 4.0.0 > leptonica-1.76.0 > libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : > libtiff

Re: [tesseract-ocr] OMP_THREAD_LIMIT=1 gives improvement in 4.1 version

2020-10-01 Thread shree
Related discussion at https://github.com/tesseract-ocr/tesseract/issues/3109 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Diacriticals Training

2020-10-01 Thread Shree Devi Kumar
Please read tesseract documentation regarding lstm training by replacing a layer. On Thu, Oct 1, 2020, 11:29 shreyansh dwivedi wrote: > Hello Shree, > Firstly, thank you for looking into it. Secondly, I would be grateful if > you share the piece of code with the explanation part of how

Re: [tesseract-ocr] unable to install tesseract-ocr in RHEL

2020-09-29 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki#rhelcentosscientific-linux-fedora-opensuse-packages On Tue, Sep 29, 2020, 12:53 Yeshwant Kumar wrote: > Hi folks, > > We are having a problem in building docker image. > > > > When base image of docker is *ubuntu:focal* we are able to install

Re: [tesseract-ocr] Diacriticals Training

2020-09-28 Thread Shree Devi Kumar
I am currently running a training run based on synthetic training data for Sanskrit to support both Devanagari script with vedic accents as well as iAST (Roman with diacritics support). I will share the traineddata for you and others who are interested to test how well it works with real life

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-27 Thread Shree Devi Kumar
Thank you for sharing the results of your trial with fine-tuning and getting better results with the official traineddata after pre-processing the images. Hope your notes will help other users with similar questions. On Sun, Sep 27, 2020, 20:51 Grad wrote: > @shree thank you for the adv

Re: [tesseract-ocr] Making a serachable PDF.

2020-09-25 Thread Shree Devi Kumar
Try to use a gui frontend such as gimagereader or ocrmypdf. Tesseract does not take pdf as input. On Fri, Sep 25, 2020, 12:58 Arvind Mahesh wrote: > > Complete programming noob, so please pardon my ignorance. > I really want to convert this PDF into a searchable pdf but I barely > understand

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-20 Thread Shree Devi Kumar
> tessedit_char_whitelist=',0123456789' > 638,997.png out > Failed to load any lstm-specific dictionaries for lang swtor!! > Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica > Warning: Invalid resolution 0 dpi. Using 70 instead. > > cat .\out.txt > 3,9,997 &g

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
l_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar wrote: > > Each of my PNG files have file names that indicate ground truth, and I > have a little script that generates ground-truth TXT files from the PNG > file names.

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
> Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names. Please review your script. I notice a number of file names ending with -2. The gt.txt files for the same also contain -2 while the image

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
Please share your training data so that we can test. Thanks. Virus-free. www.avg.com

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-09 Thread Shree Devi Kumar
Thanks, Alex. I suggest that you also add this to tesseract documentation, tessdoc repo. On Wed, Sep 9, 2020, 23:30 Александр Поздняков wrote: > Hi. > Alternatively, use AppImage (Ubuntu >= 16.04) > 1. Download > >> wget >>

Re: [tesseract-ocr] bash: training/lstmtraining: No such file or directory during tesstutorial

2020-08-26 Thread Shree Devi Kumar
Did you install tesseract training tools? try the following commands: lstmtraining --version which lstmtraining text2image --version lstmeval --version On Tue, Aug 25, 2020 at 1:15 PM Theo M-Z wrote: > I followed the tesstutorial, creating base traineddata, but at this point, > the log

  1   2   3   4   5   6   7   8   9   10   >