Iterate over all resources of each page, find the ones that are Images, and then process the data stream. (obviously, there is a LOT of specific complexity here, but that’s the overview).
You need to parse page content as well as annotation appearances, finding out what colors are invoked in what colorspaces. Again, not trivial but not rocket science either. As mentioned, without a good PDF expert, this won’t happen. And even with one, be sure to allocate a few months of development. NOTE: Other PDF libraries may provide some higher level primitives fro some of this – but it’s still going to be lots of work on your own regardless. From: "Sriram Gopalan -ERS, HCL Tech" Date: Friday, June 5, 2015 at 2:55 PM To: Leonard Rosenthol, "podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>" Cc: Ahongsangbam Dorendro, Hariprabhakaran C Subject: RE: [Podofo-users] Reg: Color Detection of PDF pages. Hi Leonard, Thank your for sharing the response. 1) Request to share the idea/logic of how to extract the raster images from a PDF page. 2) Currently we handle only CMYK/RGB color spaces by keywords. Request to share a high level view of how to handle other color spaces. You input will help us a lot. Thank you for your support. Thanks & Rgds Sriram K G From: Leonard Rosenthol [mailto:lrose...@adobe.com] Sent: 05 June 2015 21:10 To: Sriram Gopalan -ERS, HCL Tech; podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net> Cc: Ahongsangbam Dorendro; Hariprabhakaran C Subject: Re: [Podofo-users] Reg: Color Detection of PDF pages. You can certainly use PoDoFo to find out what color values of each colorspace are used on each page of a PDF. As you have seen, you will also need a deep understanding of PDF and color theory to be accomplish it – but PoDoFo can absolutely provide you with all the data for you to then analyze. Leonard From: "Sriram Gopalan -ERS, HCL Tech" Date: Friday, June 5, 2015 at 10:16 AM To: Leonard Rosenthol, "podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>" Cc: Ahongsangbam Dorendro, Hariprabhakaran C Subject: RE: [Podofo-users] Reg: Color Detection of PDF pages. Hi Leonard, Thank you for your reply. Request to confirm if we can detect color of a Page using Podofo with our approaches. If color detection of raster data is not feasible from Podofo/if this is a limitation from Podofo then we need to plan for other libraries. Thank you for your support. Thanks & Rgds Sriram K G From: Leonard Rosenthol [mailto:lrose...@adobe.com] Sent: 05 June 2015 18:15 To: Sriram Gopalan -ERS, HCL Tech; podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net> Cc: Ahongsangbam Dorendro; Hariprabhakaran C Subject: Re: [Podofo-users] Reg: Color Detection of PDF pages. Conversion to RGB will yield both false positives and false negatives. You need to do this in native colorspace. Leonard From: "Sriram Gopalan -ERS, HCL Tech" Date: Friday, June 5, 2015 at 7:05 AM To: "podofo-us...@lists.sf.net<mailto:podofo-us...@lists.sf.net>" Cc: Ahongsangbam Dorendro, Hariprabhakaran C Subject: [Podofo-users] Reg: Color Detection of PDF pages. Hi all, We have a requirement to detect color of a page from PDF. We are trying with Podofo for that. According to our study, we cannot insert/extract page s with huge size/object count. Request to correct our understanding. In case if Color detection of raster images is possible, request to share the details of the same. We tried with the following STEPS/approaches and we are facing some issues. Please find the details below: Request for your support in the same. Sl No Approach /STEPS that were tried Problems 1 a) Using PDFTokenizer iterate through each item in a page b) If the token is of type “Keyword”, then we look for the next item which contains the data of the keyword. For ex:- if the token is “Stroking Color”, then we look for the next Variant which contains the Color value. c) If the keyword falls in any of the above,we look for the next token(PDF Variant) which gives the Color Space of the object. d) Based on the color space we convert that to RGB and apply the logic of finding Color/Grayscale. 1) The above approach works for Vector objects but not for raster. For raster(images), we are not getting the appropriate keywords for Color space. So we went with approach #2. Please find the same below: 2) Currently we are referring the following keywords: - RGB Stroking/ RGB Non Stroking - CMYK Stroking/ Non Stroking - GrayScale Stroking/Non Stroking Currently we have covered RGB/CMYK/Grayscale color space. Request to confirm if any other keywords needs to be used so that all color spaces can be covered. 2 a) Using PDFMemDocument, we extracted all the images and scanned each image PIXEL by PIXEL to identify a color pixel. - If the image is JPEG, then the image will be converted to BMP and PIXEL by PIXEL RGB values will be read. - If the image is other than JPEG, it will be converted to a PPM and PIXEL by PIXEL RGB values will be read. If The image has a color PIXEL (R != G != B), then the image is identified to be a color. 1) This works fine and we are able to detect the color of the image seperately but we cannot map which image belongs to which page. as we cannot get the Page Number/reference from the "PDFImage" object. So we tried with Approach #3 3 a) Load each Page into a document. Each page in the original document will be inserted into a new PDFMemDocument using InsertPages() method. b) Follow Approach #2 for finding the color of the images in the page At any point of time, the PdfMemDocument will have only 1 page. 1) This works for small files but fails for big files. 2) Based on our study, the new PdfMemDocument returns a object count. - For a blank PDF PDF with 1 page, the new PDFMemdocument returns object count as 15. - For 100 page file (each page will have exactly on image), the object count is roughly 415 for one page. - For huge files say 600 MB file with 6200 pages, the object count goes on up to 2 LAC and CPU Utilization/memory goes upto 100% which leads to crash. Thanks & Rgds Sriram K G ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ----------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------
_______________________________________________ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users