Re: [tesseract-ocr] OCR multiple pngs into one PDF

Alessandro Griseta Sun, 06 Jul 2025 07:17:57 -0700

Sorry about that, looks like it wasn't so clear at al; - anyway, I ended up 
completing a script, so here it is:


```bash
lang="mt"
#pp replace original file
if [ lang == "" ]; then
read -p "Choose lang (gr/la/it/en): " lang && [[ $lang == la || $lang == 
grc || $lang == it ]] || exit 1
fi #x cp } deb
# b as well as presets, allow direct input > this will make cases 
statements less verbose / eliminate them entirely
#check if $lang has a value first before asking <- will need to accept cmd 
parameters if it turns into a shell script
#pp add option to choose output (pdf/txt etc.)

root="$HOME/Downloads/" #root="/storage/emulated/0/Download/tmp/" #pp 
cross-platform <- use $HOME env var
mkdir "$root"; cd "$root"
find $root -name "*.png" -type f -delete

find $root | grep -P "\.pdf" > pdfs.txt
while read pdf; do

echo "pdf to convert: $pdf"
name="$(echo $pdf | grep -o -P "[^\/]*?\.pdf")"
echo "converting $name to png..."
pdftoppm "$pdf" "$name" -png #-f 12 -l 17 #i# ^ cpu cores
find $root | grep -P "\.png$" | sort > pngs.txt
case $lang in

  gr)
    tesseract pngs.txt "$name" -l lat+grc+ita pdf
    ;;

  la)
    tesseract pngs.txt "$name" -l lat+ita pdf
    ;;

  it)
    tesseract pngs.txt "$name" -l ita pdf
    ;;
  en)
    tesseract pngs.txt "$name" -l eng+ita pdf
    ;;
  dz-la)
    tesseract pngs.txt "$name" -l lat+eng pdf
    ;;
  dz-gr)
    tesseract pngs.txt "$name" -l grc+eng pdf
    ;;
  mt)
    tesseract pngs.txt "$name" -l ita+equ --tessdata-dir 
"$HOME/Downloads/tessdata" pdf
    ;;
esac
#find $root -name "*.png" -type f -delete
done <pdfs.txt
```
It's not much, but it's good enough for me to paste into a terminal (have 
also tested on Termux)

On Saturday, 14 June 2025 at 20:13:31 UTC+2 zdenop wrote:

> Is this "AI" generated?
>
> Provided errors are not output of "Steps to reproduce"
>
> Zdenko
>
>
> so 14. 6. 2025 o 15:04 jollysalmon <[email protected]> napísal(a):
>
>> # Steps to reproduce
>>
>> ```bash
>> tesseract pngs.txt "$name" -l ita pdf
>> ```
>>
>> # Error
>>
>> ```
>> Page 0 : 
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf-1.png
>> pdf to convert: 
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.txt
>> Syntax Warning: May not be a PDF file (continuing anyway)
>> Syntax Error: Couldn't find trailer dictionary
>> Syntax Error: Couldn't find trailer dictionary
>> Syntax Error: Couldn't read xref table
>> Error in findFileFormatStream: truncated file
>> Error during processing.
>> pdf to convert: 
>> /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.pdf
>> converting 
>> CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf
>> .pdf to png...
>> Syntax Error: Document stream is empty
>> Error, could not create PDF output file: Operation not permitted
>> ```
>>
>> # Thoughts
>>
>> I'm providing tesseract a list of png files, and this worked while I was 
>> outputting text (`tesseract pngs.txt "$name" -l ita txt`, but when I tried 
>> doing the same for a pdf it didn't work :/
>>
>> I know there are lots of tools that use tesseract for this, but I prefer 
>> doing it with tesseract + combo of other tools if necessary so that I get 
>> better/easier control over tesseract itself.
>>
>> Thank you in advance, I'm sure this must be something common but I just 
>> can't seem to get it right!
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1726fc5f-202a-42ca-957f-4040f1fafcban%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1726fc5f-202a-42ca-957f-4040f1fafcban%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/0cb323c5-dc43-4764-ba3e-92e2898d01afn%40googlegroups.com.

Re: [tesseract-ocr] OCR multiple pngs into one PDF

Reply via email to