Sorry about that, looks like it wasn't so clear at al; - anyway, I ended up
completing a script, so here it is:
```bash
lang="mt"
#pp replace original file
if [ lang == "" ]; then
read -p "Choose lang (gr/la/it/en): " lang && [[ $lang == la || $lang ==
grc || $lang == it ]] || exit 1
fi #x cp } deb
# b as well as presets, allow direct input > this will make cases
statements less verbose / eliminate them entirely
#check if $lang has a value first before asking <- will need to accept cmd
parameters if it turns into a shell script
#pp add option to choose output (pdf/txt etc.)
root="$HOME/Downloads/" #root="/storage/emulated/0/Download/tmp/" #pp
cross-platform <- use $HOME env var
mkdir "$root"; cd "$root"
find $root -name "*.png" -type f -delete
find $root | grep -P "\.pdf" > pdfs.txt
while read pdf; do
echo "pdf to convert: $pdf"
name="$(echo $pdf | grep -o -P "[^\/]*?\.pdf")"
echo "converting $name to png..."
pdftoppm "$pdf" "$name" -png #-f 12 -l 17 #i# ^ cpu cores
find $root | grep -P "\.png$" | sort > pngs.txt
case $lang in
gr)
tesseract pngs.txt "$name" -l lat+grc+ita pdf
;;
la)
tesseract pngs.txt "$name" -l lat+ita pdf
;;
it)
tesseract pngs.txt "$name" -l ita pdf
;;
en)
tesseract pngs.txt "$name" -l eng+ita pdf
;;
dz-la)
tesseract pngs.txt "$name" -l lat+eng pdf
;;
dz-gr)
tesseract pngs.txt "$name" -l grc+eng pdf
;;
mt)
tesseract pngs.txt "$name" -l ita+equ --tessdata-dir
"$HOME/Downloads/tessdata" pdf
;;
esac
#find $root -name "*.png" -type f -delete
done <pdfs.txt
```
It's not much, but it's good enough for me to paste into a terminal (have
also tested on Termux)
On Saturday, 14 June 2025 at 20:13:31 UTC+2 zdenop wrote:
> Is this "AI" generated?
>
> Provided errors are not output of "Steps to reproduce"
>
> Zdenko
>
>
> so 14. 6. 2025 o 15:04 jollysalmon <[email protected]> napĂsal(a):
>
>> # Steps to reproduce
>>
>> ```bash
>> tesseract pngs.txt "$name" -l ita pdf
>> ```
>>
>> # Error
>>
>> ```
>> Page 0 :
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf-1.png
>> pdf to convert:
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.txt
>> Syntax Warning: May not be a PDF file (continuing anyway)
>> Syntax Error: Couldn't find trailer dictionary
>> Syntax Error: Couldn't find trailer dictionary
>> Syntax Error: Couldn't read xref table
>> Error in findFileFormatStream: truncated file
>> Error during processing.
>> pdf to convert:
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.pdf
>> converting
>> CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf
>> .pdf to png...
>> Syntax Error: Document stream is empty
>> Error, could not create PDF output file: Operation not permitted
>> ```
>>
>> # Thoughts
>>
>> I'm providing tesseract a list of png files, and this worked while I was
>> outputting text (`tesseract pngs.txt "$name" -l ita txt`, but when I tried
>> doing the same for a pdf it didn't work :/
>>
>> I know there are lots of tools that use tesseract for this, but I prefer
>> doing it with tesseract + combo of other tools if necessary so that I get
>> better/easier control over tesseract itself.
>>
>> Thank you in advance, I'm sure this must be something common but I just
>> can't seem to get it right!
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/1726fc5f-202a-42ca-957f-4040f1fafcban%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/1726fc5f-202a-42ca-957f-4040f1fafcban%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/0cb323c5-dc43-4764-ba3e-92e2898d01afn%40googlegroups.com.