[Mayan EDMS: 1914] Re: OCR quality JPG vs. PDF

Roberto Rosario Mon, 24 Jul 2017 23:01:15 -0700

Hello,

I recently published a blog post explaining how the converter 
works: http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages 
into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any 
degradation.

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. 
This is some being worked on (already there is a scanline filter to reduce 
pre OCR images to 2 colors), but is not available to the user yet. When 
available, it will be possible to apply a stack of transformations for the 
document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page 
JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX 
extenstions which might do but I don't how good is Pillow's JPEG 2000 
support). Another JPEG format which could be used is MJPG but it is for 
video and it would be hackish attempt to convert the frames to pages. On 
the platform side, you can group images with Mayan already using an Index 
or a SmartLink. All the JPEG uploads need is a unique marker (like a 
metadata value or a filename fragment). This can be accomplished via the UI 
and the API. For example the index template: {{ document.label|slice:":4" 
}} will group all documents with the same 4 first characters in the name. 
To use a different part of the filename for the grouping just change the 
slice argument 
(http://www.diveintopython3.net/native-datatypes.html#slicinglists).

On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
>
> Hi all!
>
> I'm currently evaluating Mayan as a replacement for my current DMS. The 
> documents are all in the JPG format, multiple pages of the same document 
> per folder, scanned at 300dpi. So far adding JPGs does not allow me to 
> create multi-page documents. I used img2pdf to generate multi-page PDFs for 
> import into Mayan, which mostly works fine. BUT: The OCR-quality for the 
> same page is worse when using the PDF files.
>
> I've tried multiple ways to generate the combined PDF and I can see some 
> differences but never managed to get the same recognition quality as using 
> the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG 
> data and since I'm using PDF page size fit to image size I don't know 
> what's going wrong here. The PDFs look fine in my PDF viewer and are 
> reported to have correct page sizes. Generating the pages with imagemagick 
> does not improve recognition.
>
> This leads me to the conclusion that the PDFs are rendered internally 
> which degrades the quality.
>
> I have two questions:
>
> 1) What can I do to improve PDF recognition quality, either in generating 
> the PDF or in Mayan settings?
> 2) Is there another way to make multi-page documents from JPGs? Maybe 
> using the REST-API?
>
> Using Mayan version 2.6.2
>
> Cheers,
> Flo
>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mayan-edms+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1914] Re: OCR quality JPG vs. PDF

Reply via email to