Re: [Podofo-users] Reg: Color Detection of PDF pages.

Leonard Rosenthol Sun, 07 Jun 2015 01:30:23 -0700

Iterate over all resources of each page, find the ones that are Images, and 
then process the data stream.   (obviously, there is a LOT of specific 
complexity here, but that’s the overview).


You need to parse page content as well as annotation appearances, finding out 
what colors are invoked in what colorspaces.   Again, not trivial but not 
rocket science either.

As mentioned, without a good PDF expert, this won’t happen.  And even with one, 
be sure to allocate a few months of development.

NOTE: Other PDF libraries may provide some higher level primitives fro some of 
this – but it’s still going to be lots of work on your own regardless.

From: "Sriram Gopalan -ERS, HCL Tech"
Date: Friday, June 5, 2015 at 2:55 PM
To: Leonard Rosenthol, 
"podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>"
Cc: Ahongsangbam Dorendro, Hariprabhakaran C
Subject: RE: [Podofo-users] Reg: Color Detection of PDF pages.

Hi Leonard,

Thank your for sharing the response.

1)      Request to share the idea/logic of how to extract the raster images 
from a PDF page.

2)      Currently we handle only CMYK/RGB color spaces by keywords.

Request to share a high level view of how to handle other color spaces.



You input will help us a lot.

Thank you for your support.



Thanks & Rgds

Sriram K G

From: Leonard Rosenthol [mailto:lrose...@adobe.com]
Sent: 05 June 2015 21:10
To: Sriram Gopalan -ERS, HCL Tech; 
podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>
Cc: Ahongsangbam Dorendro; Hariprabhakaran C
Subject: Re: [Podofo-users] Reg: Color Detection of PDF pages.

You can certainly use PoDoFo to find out what color values of each colorspace 
are used on each page of a PDF.   As you have seen, you will also need a deep 
understanding of PDF and color theory to be accomplish it – but PoDoFo can 
absolutely provide you with all the data for you to then analyze.

Leonard

From: "Sriram Gopalan -ERS, HCL Tech"
Date: Friday, June 5, 2015 at 10:16 AM
To: Leonard Rosenthol, 
"podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>"
Cc: Ahongsangbam Dorendro, Hariprabhakaran C
Subject: RE: [Podofo-users] Reg: Color Detection of PDF pages.

Hi Leonard,

Thank you for your reply.

Request to confirm if we can detect color of a Page using Podofo with our 
approaches.
If color detection of raster data is not feasible from Podofo/if this is a 
limitation from Podofo then we need to plan for other libraries.

Thank you for your support.

Thanks & Rgds
Sriram K G

From: Leonard Rosenthol [mailto:lrose...@adobe.com]
Sent: 05 June 2015 18:15
To: Sriram Gopalan -ERS, HCL Tech; 
podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>
Cc: Ahongsangbam Dorendro; Hariprabhakaran C
Subject: Re: [Podofo-users] Reg: Color Detection of PDF pages.

Conversion to RGB will yield both false positives and false negatives.  You 
need to do this in native colorspace.

Leonard

From: "Sriram Gopalan -ERS, HCL Tech"
Date: Friday, June 5, 2015 at 7:05 AM
To: "podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>"
Cc: Ahongsangbam Dorendro, Hariprabhakaran C
Subject: [Podofo-users] Reg: Color Detection of PDF pages.

Hi all,

We have a requirement to detect  color of a page from PDF.
We are trying with Podofo for that.

According to our study, we cannot insert/extract page s with huge size/object 
count.
Request to correct our understanding.
In case if Color detection of raster images is possible, request to share the 
details of the same.

We tried with the following STEPS/approaches and we are facing some issues.
Please find the details below:

Request for your support in the same.

Sl No

Approach /STEPS that were tried

Problems

1

a) Using PDFTokenizer iterate through each item in a page
b) If the token  is of type “Keyword”, then we look for the next item which 
contains the data of the keyword.
For ex:- if the token is “Stroking Color”, then we look for the next Variant 
which contains the Color value.
c) If the keyword falls in any of the above,we look for the next token(PDF 
Variant) which gives the Color Space of the object.
d) Based on the color space we convert that to RGB and apply the logic of 
finding Color/Grayscale.

1) The above approach works for Vector objects but not for raster. For 
raster(images), we are not getting the appropriate keywords for Color space.
So we went with approach #2. Please find the same below:
2) Currently we are referring the following keywords:
- RGB Stroking/ RGB Non Stroking
- CMYK Stroking/ Non Stroking
- GrayScale Stroking/Non Stroking
Currently we have covered RGB/CMYK/Grayscale color space.
Request to confirm if any other keywords needs to be used so that all color 
spaces can be covered.

2

a) Using PDFMemDocument, we extracted all the images  and scanned each image 
PIXEL by PIXEL to identify a color pixel.
- If the image  is JPEG, then the image will be converted to BMP and PIXEL by 
PIXEL RGB values will be read.
- If the image is other than JPEG, it will be converted to a PPM and PIXEL by 
PIXEL RGB values will be read.
If The image has a color PIXEL (R != G != B), then the image is identified to 
be a color.

1) This works fine and we are able to detect the color of the image seperately 
but we cannot map which image belongs to which page.
as we cannot get the Page Number/reference from the "PDFImage" object.

So we tried with Approach #3

3

a) Load each Page into a document.
     Each page in the original document will be inserted into a  new 
PDFMemDocument using InsertPages() method.
b) Follow Approach #2 for finding the color of the images in the page

At any point of time, the PdfMemDocument will have only 1 page.

1) This works for small files but fails for big files.
2) Based on our study, the new PdfMemDocument returns a object count.
- For a blank PDF PDF with 1 page, the new PDFMemdocument returns object count 
as 15.
- For 100 page file (each page will have exactly on image), the object count is 
roughly 415 for one page.
- For huge files say 600 MB file with 6200 pages, the object count goes on up 
to 2 LAC and CPU Utilization/memory  goes upto 100% which leads to crash.



Thanks & Rgds
Sriram K G



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.
----------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Reg: Color Detection of PDF pages.

Reply via email to