UPDATE I was able to get the spanish OCR working by simply deleting the mayan-edms docker container and running it again, this successfully installed tesseract-ocr-spa.deb
On Friday, 11 May 2018 12:37:39 UTC-5, Pablo Castro wrote: > > Hello, > > I installed Mayan with the following guide: > https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/ > > Which means I have 2 docker containers with Mayan-EDMS and MySQL running > in an Ubuntu box. > > I tried the OCR function but was getting the following error in the OCR > errors log: > > (1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column > 'content' at row 1") > > Tried with a different document and got a similar error: > > (1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column > 'content' at row 1") > > I assumed it was because the documents were being uploaded with "English" > as the document language, so I changed the default document language as > follows: > > > I modified the local.py file under > var/lib/docker/volumes/mayan_data/_data/settings and added the following > lines: > > DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), ( > 'spa', 'Spanish')) > DOCUMENTS_LANGUAGE = 'spa' > > This worked fine and now the default language when adding a new document > is Spanish and the list contains just spanish, english and german. > > Afterwards, I modified the envfile to install the spansh tesseract package > > # MySQL container > MYSQL_ROOT_PASSWORD=******** > MYSQL_PASSWORD=********* > MYSQL_DATABASE=mayan_db > MYSQL_USER=mayan_user > > # Mayan container > MAYAN_DATABASE_DRIVER=django.db.backends.mysql > MAYAN_DATABASE_NAME=mayan_db > MAYAN_DATABASE_USER=mayan_user > MAYAN_DATABASE_PASSWORD=******** > MAYAN_DATABASE_HOST=mayan-mysql > MAYAN_DATABASE_PORT=3306 > MAYAN_APT_INSTALLS=libsasl2-dev python-dev libldap2-dev libssl-dev > *tesseract-ocr-spa* > MAYAN_PIP_INSTALLS=python-ldap==2.4.41 django-auth-ldap==1.2.14 > > I assumed this should be enough for OCR to be working in spanish, so I > restarted the docker container and uploaded a document for OCR > > OCR is still not working, and there's no error log under the OCR errors > tool. > > I checked the docker logs for the mayan-edms container and found this: > > Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata > Please make sure the TESSDATA_PREFIX environment variable is set to the > parent directory of your "tessdata" directory. > Failed loading language 'spa' > Tesseract couldn't load any languages! > [2018-05-11 16:55:37,489: ERROR/MainProcess] Task > ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised > unexpected: WorkerLostError('Worker exited prematurely: signal 11 > (SIGSEGV).',) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line > 1175, in mark_as_worker_lost > human_status(exitcode)), > WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV). > > > Has anyone experienced something similar? I am still searching for ways to > modify the TESSDATA_PREFIX environment variable but my experience with > docker is limited. > > Any help is appreciated. > > > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
