Thanks for update Pablo, this will help fix the issue faster. On Fri, May 11, 2018, 4:18 PM Pablo Castro <[email protected]> wrote:
> UPDATE > > I was able to get the spanish OCR working by simply deleting the > mayan-edms docker container and running it again, this successfully > installed tesseract-ocr-spa.deb > > > > On Friday, 11 May 2018 12:37:39 UTC-5, Pablo Castro wrote: >> >> Hello, >> >> I installed Mayan with the following guide: >> https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/ >> >> Which means I have 2 docker containers with Mayan-EDMS and MySQL running >> in an Ubuntu box. >> >> I tried the OCR function but was getting the following error in the OCR >> errors log: >> >> (1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column >> 'content' at row 1") >> >> Tried with a different document and got a similar error: >> >> (1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column >> 'content' at row 1") >> >> I assumed it was because the documents were being uploaded with "English" >> as the document language, so I changed the default document language as >> follows: >> >> >> I modified the local.py file under >> var/lib/docker/volumes/mayan_data/_data/settings and added the following >> lines: >> >> DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), ( >> 'spa', 'Spanish')) >> DOCUMENTS_LANGUAGE = 'spa' >> >> This worked fine and now the default language when adding a new document >> is Spanish and the list contains just spanish, english and german. >> >> Afterwards, I modified the envfile to install the spansh tesseract package >> >> # MySQL container >> MYSQL_ROOT_PASSWORD=******** >> MYSQL_PASSWORD=********* >> MYSQL_DATABASE=mayan_db >> MYSQL_USER=mayan_user >> >> # Mayan container >> MAYAN_DATABASE_DRIVER=django.db.backends.mysql >> MAYAN_DATABASE_NAME=mayan_db >> MAYAN_DATABASE_USER=mayan_user >> MAYAN_DATABASE_PASSWORD=******** >> MAYAN_DATABASE_HOST=mayan-mysql >> MAYAN_DATABASE_PORT=3306 >> MAYAN_APT_INSTALLS=libsasl2-dev python-dev libldap2-dev libssl-dev >> *tesseract-ocr-spa* >> MAYAN_PIP_INSTALLS=python-ldap==2.4.41 django-auth-ldap==1.2.14 >> >> I assumed this should be enough for OCR to be working in spanish, so I >> restarted the docker container and uploaded a document for OCR >> >> OCR is still not working, and there's no error log under the OCR errors >> tool. >> >> I checked the docker logs for the mayan-edms container and found this: >> >> Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata >> Please make sure the TESSDATA_PREFIX environment variable is set to the >> parent directory of your "tessdata" directory. >> Failed loading language 'spa' >> Tesseract couldn't load any languages! >> [2018-05-11 16:55:37,489: ERROR/MainProcess] Task >> ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised >> unexpected: WorkerLostError('Worker exited prematurely: signal 11 >> (SIGSEGV).',) >> Traceback (most recent call last): >> File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line >> 1175, in mark_as_worker_lost >> human_status(exitcode)), >> WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV). >> >> >> Has anyone experienced something similar? I am still searching for ways >> to modify the TESSDATA_PREFIX environment variable but my experience with >> docker is limited. >> >> Any help is appreciated. >> >> >> -- > > --- > You received this message because you are subscribed to the Google Groups > "Mayan EDMS" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
