[Mayan EDMS: 607] Re: OCR error on .doc files

Alek Geldenberg Tue, 06 Aug 2013 12:16:57 -0700

After I modified content of /etc/magic file as shown in my previous posts, 
Mayan does not have a problem with processing docx file produced by 
Microsoft Word.  It, however, has problem processing docx files produced by 
Libre Office, which is not a big deal, since it can process .odt files.


On Monday, August 5, 2013 5:49:07 AM UTC-4, Youri Lacan-Bartley wrote:
>
> Hi Alek,
>
> I imagine Roberto meant upgrading libmagic1 proper, not actually modifying 
> the /etc/magic config file which to my understanding is specific to the 
> file(1) command.
> Which version of libmagic1 are you using?
>
> I seem to have MS Office documents being correctly detected with version 
> 5.11-2.
>
> On Sunday, 4 August 2013 16:03:55 UTC+2, Alek Geldenberg wrote:
>>
>> Correction:
>>
>> Changing /etc/magic file as I described fix uploading files made by MS 
>> Office.  However, docx files created by Libre Office are still recognized 
>> as zip files.  I would love to know how /etc/magic has to be modified so 
>> that docx, xlsx, pptx files created by Libre Office would also be properly 
>> recognized.
>>
>> On Sunday, August 4, 2013 9:58:34 AM UTC-4, Alek Geldenberg wrote:
>>>
>>> Roberto,
>>>
>>> Could you, kindly, post what exactly you did to "upgrade libmagic1 
>>> file".  I have read some posts about changing /etc/magic file with the 
>>> content of msooxml.  I tried to upgrade it by two ways:
>>>
>>> #   Correct the mimetype with the registered ones:
>>> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>> >>>>&26         string          word/           Microsoft Word 2007+
>>> !:mime 
>>> application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>> >>>>&26         string          ppt/            Microsoft PowerPoint 
>>> 2007+
>>> !:mime 
>>> application/vnd.openxmlformats-officedocument.presentationml.presentation
>>> >>>>&26         string          xl/             Microsoft Excel 2007+
>>> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>> >>>>&26         default         x               Microsoft OOXML
>>> !:strength +10
>>>
>>>
>>> and this way:
>>>
>>> #   Correct the mimetype with the registered ones:
>>> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>> >>>>&26         string          word/           Microsoft Word 2007+ 
>>> !:mime 
>>> application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>> >>>>&26         string          ppt/            Microsoft PowerPoint 
>>> 2007+ !:mime 
>>> application/vnd.openxmlformats-officedocument.presentationml.presentation
>>> >>>>&26         string          xl/             Microsoft Excel 2007+ 
>>> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>> >>>>&26         default         x               Microsoft OOXML 
>>> !:strength +10
>>>
>>>
>>>
>>> Here is the output of the file command:
>>> $ file testfile.docx
>>>
>>> /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime 
>>> application/vnd.openxmlformats-offi' truncated
>>> /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime 
>>> application/vnd.openxmlformat' truncated
>>> /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime 
>>> application/vnd.openxmlformats-off' truncated
>>>
>>> However, none of that worked.  The docx files are still uploaded into 
>>> Mayan as zip files.
>>>
>>> I am running Ubuntu 12.04 LTS.
>>>
>>>
>>> I hope you can help me with this issue.
>>>
>>>
>>> On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote:
>>>>
>>>> During a recent installation of Mayan, wordprocessing documents (.docx) 
>>>> were being detected as zip/compressed files and OCR was failing on them. 
>>>>  .docx are in fact compressed files containing several XML files. 
>>>>  Upgrading the libmagic1 file allowed the 'file' command to detect the 
>>>> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was 
>>>> able to OCR the documents correctly.  This could be one of the causes for 
>>>> the OCR failure being experienced in the thread.  Check to see if the 
>>>> 'file' command correctly detects the document type.  
>>>>
>>>> This is the current list of file MIME types Mayan will pass to 
>>>> LibreOffice for conversion to PDF if detected: 
>>>> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17
>>>>
>>>>
>>>> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:
>>>>>
>>>>> Hi Charles, Roberto and Steve, 
>>>>>
>>>>> I'm loving this software, i'm actualy planning to start a business of 
>>>>> files digitalization for small busines and this software is the one 
>>>>> i'm liking more. 
>>>>>
>>>>> I'm having the same problem as you two a simple error given by the 
>>>>> binaries in the ocr cue. 
>>>>>
>>>>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with 
>>>>> the same error about  "the magic %P", unpaper don't handle pdf !!! So 
>>>>> Roberto may give us another way to check what is going inside mayan so 
>>>>> we can simulate it by hand. 
>>>>>
>>>>> As far as i see there's no pdf output file from the "document as an 
>>>>> image" in the temporary folder, just a file called IHAKtmp which is 
>>>>> empty so i guess the problem is at the first step which shoud be 
>>>>> libreoffice jpg to pdf conversion. That may make sense since we are 
>>>>> all using the same version of unpaper and tesseract and we may no be 
>>>>> using the same LibreOffice. 
>>>>>
>>>>> I'm in a hurry trying to figure which is the best software for my 
>>>>> company and I would happly make a donation when i'll have it working 
>>>>> localy. 
>>>>>
>>>>>
>>>>> Also, while trying to solve this issue i've came to this observations: 
>>>>>
>>>>>
>>>>> 1 
>>>>> -------------------------- 
>>>>> Tesseract has to have it's language training files in the usr/local/ 
>>>>> in order to work 
>>>>>
>>>>> like this: 
>>>>>
>>>>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls 
>>>>> cat.traineddata   eng.cube.fold  eng.cube.params 
>>>>> eng.tesseract_cube.nn 
>>>>> configs           eng.cube.lm    eng.cube.size       eng.traineddata 
>>>>> eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs 
>>>>>
>>>>>
>>>>> 2 
>>>>> -------------------------- 
>>>>> making tesseract to work with a .jpg from the scan has EXTREMELY 
>>>>> better results than giving it a ppm "cleaned" by unpaper , in the 
>>>>> first case only 5 words in a page where mistaken and a cleaned ppm 
>>>>> tesseract gave only 3 comprensible words in the whole page. No PDF 
>>>>> (jpg converted via libre office) is accepted by tesseract giving a : 
>>>>>
>>>>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed 
>>>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica 
>>>>> Error in pixReadStream: Unknown format: no pix returned 
>>>>> Error in pixRead: pix not read 
>>>>> Unsupported image type. 
>>>>>
>>>>> 3 
>>>>> ---------------------------- 
>>>>> Having a metadata tag indicating a language in mayan and using this to 
>>>>> set the language flag of tesseract can improve results a lot ! (50 
>>>>> words per page) If my project is finally using mayan i would try to 
>>>>> program this feature. 
>>>>>
>>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mayan-edms+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[Mayan EDMS: 607] Re: OCR error on .doc files

Reply via email to