[jira] Updated: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-582: -- Attachment: PageDrawer.patch The patch adds a basic implementation for PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT in order to support applications where a text is invisibly included in a PDF as part of an OCR result. A more generic approach needs to be implemented in order to fully support the different text rendering modes > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: PageDrawer.patch, pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852034#action_12852034 ] Igor Podolskiy commented on PDFBOX-582: --- Indeed, the issue I was suspecting was fixed by Andreas (at least the code seems changed in the right direction with the IndexColorModels and all). > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852028#action_12852028 ] Ken Weinert commented on PDFBOX-582: This last comment fits with my experience. We frequently overlay transparent text on top of an image so that the user can select the text for copy/paste (I believe it's mode 3 text IIRC.) So it makes sense that if PDFBox doesn't support that mode that the text will be visible. > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851934#action_12851934 ] Maruan Sahyoun commented on PDFBOX-582: --- Hi - the issue with that document - and with others e.g. created by Adobe Acrobat - is that they use text rendering "Neither fill nor stroke text (invisible)." As we currently do not support that but fall back to text rendering "Fill text." the text is visible. I'm already working on a patch a patch which implements the missing text rendering > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851885#action_12851885 ] Andreas Lehmkühler commented on PDFBOX-582: --- I've recently resolved some issues (PDFBOX-574,PDFBOX-584,PDFBOX-665,PDFBOX-672) concerning the rendering of XObjectImage, such as 1-bit TIFFs. If possible try to use the most recent trunk version. > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851883#action_12851883 ] Igor Podolskiy commented on PDFBOX-582: --- AFAIK there are no special provisions in PDFs and/or readers to handle those scanned documents, although I'm fairly familiar with the PDF format. There's text, and then there's an opaque image over it, that's all. It's image over text, not text over image, so there's nothinh to ignore. I occasionally create such PDFs myself, for example with the hocr2pdf tool. I can remember that I ran into this problem recently (PDFBox displaying both OCR text and images). I didn't have time to debug it to the end, but I think the problem was somehow related to my scanner producing 1-bit TIFFs and PDFBox' PageDrawer not displaying them correctly (what should be white appeared as transparent). The order was all right (image on top of text), but this transparency made it look reversed and confusing. I'll try to find time today or tomorrow to recollect the stuff and post it here, but I definitely know that 1-bit image were somehow key to this. > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-582) Ignoring text over images
[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851877#action_12851877 ] Michael Howard commented on PDFBOX-582: --- Daniel is correct in that the vast majority of .pdf documents contain a mixture of images and text. However, I think that the point that Villu is making is that many/most .pdf docs that are generated through OCR scanning are treated differently. In my experience, these docs tend to display only the image. The underlying text is there for text selection and for searching, but the underlying OCR-generated text is not displayed. Note that the OCR error-rate is frequently quite high, but since the page image is what you view/read/print then it is generally fine. The high-error OCR is better than nothing. I have several .pdf docs from different scanner vendors. They function correctly on Acrobat Reader, Mac OS X Preview and Linux/Gnome Evince Viewer ... correctly in that only the image is rendered for display/printing. PDFBox 1.0.0 displays these documents incorrectly in that the fonts are rendered over the top of the page image. This makes the documents unusable because the rendered font chars overlay the char images. Because of alignment and OCR-error issues these documents become unreadable in PDFBox. I don't know much about the .pdf format, but I assume that there must be some indicator in the format which says that these fonts strings are not to be rendered. > Ignoring text over images > - > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities >Affects Versions: 0.8.0-incubator >Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: PageDrawer renders page twice
Hi, Betreff: PageDrawer renders page twice Gesendet: Mi, 31. Mrz 2010 Von: Maruan Sahyoun > Hi, > > during my debugging of PrintPDF I saw that text is printed twice e.g. all > strings are printed by writeFont from the top of the page to the end and > then again. Is that by design or should I start to look into why that is > happening? An initial debugging showed that the processing already starts > repeating in PageDrawer.processTextPosition() AFAIU this is not a bug, it's a feature. The Pageable interface is used to print a PDDocument. The first pass is needed to precalculate some aspects of the document to be printed, such as the number of pages, pagesize etc. and the second pass is used for the real printing. So IMHO everything is ok. BR Andreas Lehmkühler
PageDrawer renders page twice
Hi, during my debugging of PrintPDF I saw that text is printed twice e.g. all strings are printed by writeFont from the top of the page to the end and then again. Is that by design or should I start to look into why that is happening? An initial debugging showed that the processing already starts repeating in PageDrawer.processTextPosition() Kind regards Maruan Sahyoun
[jira] Commented: (PDFBOX-675) Upgrade .Net build to use IKVM version 0.42 - Opinions wanted
[ https://issues.apache.org/jira/browse/PDFBOX-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851862#action_12851862 ] Daniel Wilson commented on PDFBOX-675: -- This in from the main man behind IKVM: "0.42 requires .NET 2.0 SP1" Nonetheless, .Net 2.0 SP1 or later is sufficiently ubiquitous, that I'm comfortable with the requirement. My testing of the font fix is still in progress ... > Upgrade .Net build to use IKVM version 0.42 - Opinions wanted > - > > Key: PDFBOX-675 > URL: https://issues.apache.org/jira/browse/PDFBOX-675 > Project: PDFBox > Issue Type: Improvement >Reporter: Daniel Wilson >Assignee: Daniel Wilson >Priority: Minor > > The current .Net build script (ant build.NET) is for IKVM 0.38, released 15 > months ago. > Since that time, IKVM has grown to support a larger portion of the Java > object model. I am currently investigating the possibility of improved font > support, as our IKVM-compiled version crashes if PDType1CFont.prepareAWTFont > is called. > The downside of the upgrade will be loss of support for the .Net 1.1 > Framework. In my opinion, that is not a big deal as very few projects still > rely on it. > I welcome opinions before committing any changes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ANNOUNCE] Apache PDFBox 1.1.0 released
The Apache PDFBox community is pleased to announce the first release of Apache PDFBox version 1.1.0. The release is available for download at: http://pdfbox.apache.org/download.html See the full release notes below for details about this release. Release Notes -- Apache PDFBox -- Version 1.1.0 Introduction PDFBox is an open source Java library for working with PDF documents. This is an incremental feature release based on the earlier 1.0.0 release. Unlike previous PDFBox releases, this release contains also updated versions of the supporting FontBox and JempBox libraries. The other notable changes in this release include basic support for tagged PDF, various font handling improvements and better handling of CJK character sets. For more details, please refer to the following issues on the PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX. New Features [PDFBOX-7] extract information from tagged PDF [PDFBOX-48] Create a tagged PDF [PDFBOX-67] Implement StructTreeRoot/StructTree classes in the PDModel [PDFBOX-636] Add decoded stream length to PDStream [PDFBOX-640] Add getter/setter for alternate field name (TU) to PDField Improvements [PDFBOX-628] Too many detours in COSDictionary convenience methods [PDFBOX-630] Create PDDictionaryWrapper [PDFBOX-633] Add indexOfObject and removeObject methods with ... [PDFBOX-635] Fallback mechanism for broken CFF fonts [PDFBOX-643] Date conversion errors [PDFBOX-644] Move FontBox and JempBox under the same trunk with PDFBox [PDFBOX-646] Map the form space to user space if the optional form ... [PDFBOX-653] Document the missing command line tools [PDFBOX-654] Extracting CJK text [PDFBOX-655] Default character width should be used if width of a ... [PDFBOX-663] Ensuring non-null FontDescriptor for external TrueType fonts Bug Fixes [PDFBOX-55] Invalid character while extracting text from a chinese pdf [PDFBOX-116] PNG image page completely garbled [PDFBOX-259] support request chinese-traditional [PDFBOX-420] Japanese Characters are garbled. [PDFBOX-619] Adobe CFF/Type2 font encoding enhancements [PDFBOX-621] XMPSchema.getIntegerProperty does not return existing value [PDFBOX-624] Misplaced text [PDFBOX-632] Invalid page rendering while printing a PDF with an image ... [PDFBOX-634] CFF parsing failure [PDFBOX-637] problem with static code in COSInteger/COSNumber [PDFBOX-645] PDDocumentOutline should not have getParent() [PDFBOX-656] Typo: there is no DecodeParams value. The correct name is ... [PDFBOX-658] Fix typo in FontMapping.properties [PDFBOX-660] Applying FontMatrix scale factors to PDFont drawing operations [PDFBOX-666] Ensure the correct path direction when drawing a rectangle Release Contents This release consists of a single source archive packaged as a zip file. The archive can be unpacked with the jar tool from your JDK installation. See the README.txt file for instructions on how to build this release. The source archive is accompanied by SHA1 and MD5 checksums and a PGP signature that you can use to verify the authenticity of your download. The public key used for the PGP signature can be found at https://svn.apache.org/repos/asf/pdfbox/KEYS. About Apache PDFBox --- Apache PDFBox is an open source Java library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License, Version 2.0. For more information, visit http://pdfbox.apache.org/ About The Apache Software Foundation Established in 1999, The Apache Software Foundation provides organizational, legal, and financial support for more than 100 freely-available, collaboratively-developed Open Source projects. The pragmatic Apache License enables individual and commercial users to easily deploy Apache software; the Foundation's intellectual property framework limits the legal exposure of its 2,500+ contributors. For more information, visit http://www.apache.org/
[jira] Updated: (PDFBOX-676) Predefined paper sizes in PDPage are slightly off
[ https://issues.apache.org/jira/browse/PDFBOX-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-676: -- Description: PDPage predefines several paper sizes. The A paper sizes are slightly off by 2 to 3 milimeters. The patch fixes that. In addition to that a new constructor is added allowing to specify a paper size when creating a new page. (was: PDModel predefines several paper sizes. The A paper sizes are slightly off by 2 to 3 milimeters. The patch fixes that. In addition to that a new constructor is added allowing to specify a paper size when creating a new page.) Summary: Predefined paper sizes in PDPage are slightly off (was: Predefined paper sizes in PDModel are slightly off) > Predefined paper sizes in PDPage are slightly off > - > > Key: PDFBOX-676 > URL: https://issues.apache.org/jira/browse/PDFBOX-676 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: PDPage.patch > > > PDPage predefines several paper sizes. The A paper sizes are slightly off by > 2 to 3 milimeters. The patch fixes that. In addition to that a new > constructor is added allowing to specify a paper size when creating a new > page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-676) Predefined paper sizes in PDModel are slightly off
[ https://issues.apache.org/jira/browse/PDFBOX-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-676: -- Attachment: PDPage.patch Patch to correct paper szes in PDPage > Predefined paper sizes in PDModel are slightly off > -- > > Key: PDFBOX-676 > URL: https://issues.apache.org/jira/browse/PDFBOX-676 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: PDPage.patch > > > PDModel predefines several paper sizes. The A paper sizes are slightly off by > 2 to 3 milimeters. The patch fixes that. In addition to that a new > constructor is added allowing to specify a paper size when creating a new > page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PDFBOX-676) Predefined paper sizes in PDModel are slightly off
Predefined paper sizes in PDModel are slightly off -- Key: PDFBOX-676 URL: https://issues.apache.org/jira/browse/PDFBOX-676 Project: PDFBox Issue Type: Improvement Components: PDModel Reporter: Maruan Sahyoun Priority: Minor PDModel predefines several paper sizes. The A paper sizes are slightly off by 2 to 3 milimeters. The patch fixes that. In addition to that a new constructor is added allowing to specify a paper size when creating a new page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.