[Mayan EDMS: 2463] Re: Error with OCR in Spanish - Mayan 2.7

Pablo Castro Fri, 11 May 2018 13:19:38 -0700

UPDATE

I was able to get the spanish OCR working by simply deleting the mayan-edms 
docker container and running it again, this successfully installed 
tesseract-ocr-spa.deb




On Friday, 11 May 2018 12:37:39 UTC-5, Pablo Castro wrote:
>
> Hello,
>
> I installed Mayan with the following guide: 
> https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/
>
> Which means I have 2 docker containers with Mayan-EDMS and MySQL running 
> in an Ubuntu box.
>
> I tried the OCR function but was getting the following error in the OCR 
> errors log:
>
> (1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column 
> 'content' at row 1")
>
> Tried with a different document and got a similar error:
>
> (1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column 
> 'content' at row 1")
>
> I assumed it was because the documents were being uploaded with "English" 
> as the document language, so I changed the default document language as 
> follows:
>
>
> I modified the local.py file under 
> var/lib/docker/volumes/mayan_data/_data/settings and added the following 
> lines:
>
> DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), (
> 'spa', 'Spanish'))
> DOCUMENTS_LANGUAGE = 'spa'
>
> This worked fine and now the default language when adding a new document 
> is Spanish and the list contains just spanish, english and german.
>
> Afterwards, I modified the envfile to install the spansh tesseract package
>
> # MySQL container
> MYSQL_ROOT_PASSWORD=********
> MYSQL_PASSWORD=*********
> MYSQL_DATABASE=mayan_db
> MYSQL_USER=mayan_user
>
> # Mayan container
> MAYAN_DATABASE_DRIVER=django.db.backends.mysql
> MAYAN_DATABASE_NAME=mayan_db
> MAYAN_DATABASE_USER=mayan_user
> MAYAN_DATABASE_PASSWORD=********
> MAYAN_DATABASE_HOST=mayan-mysql
> MAYAN_DATABASE_PORT=3306
> MAYAN_APT_INSTALLS=libsasl2-dev python-dev libldap2-dev libssl-dev 
> *tesseract-ocr-spa*
> MAYAN_PIP_INSTALLS=python-ldap==2.4.41 django-auth-ldap==1.2.14
>
> I assumed this should be enough for OCR to be working in spanish, so I 
> restarted the docker container and uploaded a document for OCR
>
> OCR is still not working, and there's no error log under the OCR errors 
> tool.
>
> I checked the docker logs for the mayan-edms container and found this:
>
> Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to the 
> parent directory of your "tessdata" directory.
> Failed loading language 'spa'
> Tesseract couldn't load any languages!
> [2018-05-11 16:55:37,489: ERROR/MainProcess] Task 
> ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised 
> unexpected: WorkerLostError('Worker exited prematurely: signal 11 
> (SIGSEGV).',)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 
> 1175, in mark_as_worker_lost
>     human_status(exitcode)),
> WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
>
>
> Has anyone experienced something similar? I am still searching for ways to 
> modify the TESSDATA_PREFIX environment variable but my experience with 
> docker is limited.
>
> Any help is appreciated.
>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 2463] Re: Error with OCR in Spanish - Mayan 2.7

Reply via email to