I followed the steps for fine-tuning Tesseract for handwriting recognition.
I have the character images and the corresponding box files. Then I
generated the .lstmf files, followed by the lstm_train.txt and
lstm_test.txt files.
However, when I launch the training using these list files, it doesn't
work. But when I test the training with only a single path in the train and
test text files, it works perfectly — the training starts correctly.
Also, all the .lstmf files are generated properly, because I wrote a script
that trains on each file one by one, continuing from the last checkpoint
each time. This worked for all the .lstmf files.
I'm not sure if the issue is with the generation of the lstm_train.txt, or
if lstmtraining only accepts a single .lstmf file as input?
Here is the code for generating the lstm_train.txt and lstm_test.txt files :
import os
import random
input_dir = "test"
train_file = "lstm_train.txt"
test_file = "lstm_test.txt"
# Liste tous les fichiers .lstmf
all_files = [f for f in os.listdir(input_dir) if f.endswith(".lstmf")]
random.shuffle(all_files) # Mélange aléatoire
# Proportion pour l'entraînement (80%)
train_split = 0.8
train_count = int(len(all_files) * train_split)
train_files = all_files[:train_count]
test_files = all_files[train_count:]
# Écriture des fichiers train et test avec chemins relatifs
with open(train_file, "w", encoding="utf-8") as f_train, \
open(test_file, "w", encoding="utf-8") as f_test:
for f in train_files:
relative_path = os.path.join(input_dir, f)
f_train.write(relative_path+"\n")
for f in test_files:
relative_path = os.path.join(input_dir, f)
f_test.write(relative_path+"\n")
print(f"[OK] Fichiers '{train_file}' et '{test_file}' créés avec chemins
relatifs.")
voici un extrait de fichier lstm_train.txt :
[image: Capture d'écran 2025-06-11 095440.png]
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/5641c5d8-42b1-46d8-8ce0-67f614cf32dbn%40googlegroups.com.