[jira] Updated: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-582:
--

Attachment: PageDrawer.patch

The patch adds a basic implementation for 
PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT in order to support 
applications where a text is invisibly included in a PDF as part of an OCR 
result.

A more generic approach needs to be implemented in order to fully support the 
different text rendering modes

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: PageDrawer.patch, pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Igor Podolskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852034#action_12852034
 ] 

Igor Podolskiy commented on PDFBOX-582:
---

Indeed, the issue I was suspecting was fixed by Andreas (at least the code 
seems changed in the right direction with the IndexColorModels and all).

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Ken Weinert (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852028#action_12852028
 ] 

Ken Weinert commented on PDFBOX-582:


This last comment fits with my experience. We frequently overlay transparent 
text on top of an image so that the user can select the text for copy/paste (I 
believe it's mode 3 text IIRC.)

So it makes sense that if PDFBox doesn't support that mode that the text will 
be visible.


> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851934#action_12851934
 ] 

Maruan Sahyoun commented on PDFBOX-582:
---

Hi - the issue with that document - and with others e.g. created by Adobe 
Acrobat - is that they use text rendering "Neither fill nor stroke text 
(invisible)." As we currently do not support that but fall back to text 
rendering "Fill text." the text is visible. 

I'm already working on a patch a patch which implements the missing text 
rendering

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851885#action_12851885
 ] 

Andreas Lehmkühler commented on PDFBOX-582:
---

I've recently resolved some issues 
(PDFBOX-574,PDFBOX-584,PDFBOX-665,PDFBOX-672) concerning the rendering of 
XObjectImage, such as 1-bit TIFFs. If possible try to use the most recent trunk 
version.

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Igor Podolskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851883#action_12851883
 ] 

Igor Podolskiy commented on PDFBOX-582:
---

AFAIK there are no special provisions in PDFs and/or readers to handle those 
scanned documents, although I'm fairly familiar with the PDF format. There's 
text, and then there's an opaque image over it, that's all. It's image over 
text, not text over image, so there's nothinh to ignore. I occasionally create 
such PDFs myself, for example with the hocr2pdf tool.

I can remember that I ran into this problem recently (PDFBox displaying both 
OCR text and images). I didn't have time to debug it to the end, but I think 
the problem was somehow related to my scanner producing 1-bit TIFFs and PDFBox' 
PageDrawer not displaying them correctly (what should be white appeared as 
transparent). The order was all right (image on top of text), but this 
transparency made it look reversed and confusing.

I'll try to find time today or tomorrow to recollect the stuff and post it 
here, but I definitely know that 1-bit image were somehow key to this.

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-582) Ignoring text over images

2010-03-31 Thread Michael Howard (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851877#action_12851877
 ] 

Michael Howard commented on PDFBOX-582:
---

Daniel is correct in that the vast majority of .pdf documents contain a mixture 
of images and text. 

However, I think that the point that Villu is making is that many/most .pdf 
docs that are generated through OCR scanning are treated differently. In my 
experience, these docs tend to display only the image. The underlying text is 
there for text selection and for searching, but the underlying OCR-generated 
text is not displayed. 

Note that the OCR error-rate is frequently quite high, but since the page image 
is what you view/read/print then it is generally fine. The high-error OCR is 
better than nothing. 

I have several .pdf docs from different scanner vendors. They function 
correctly on Acrobat Reader, Mac OS X Preview and Linux/Gnome Evince Viewer ... 
correctly in that only the image is rendered for display/printing. 

PDFBox 1.0.0 displays these documents incorrectly in that the fonts are 
rendered over the top of the page image. This makes the documents unusable 
because the rendered font chars overlay the char images. Because of alignment 
and OCR-error issues these documents become unreadable in PDFBox. 

I don't know much about the .pdf format, but I assume that there must be some 
indicator in the format which says that these fonts strings are not to be 
rendered. 

> Ignoring text over images
> -
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction, Utilities
>Affects Versions: 0.8.0-incubator
>Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: PageDrawer renders page twice

2010-03-31 Thread Andreas Lehmkühler
Hi,

Betreff: PageDrawer renders page twice
Gesendet: Mi, 31. Mrz 2010
Von: Maruan Sahyoun

> Hi,
> 
> during my debugging of PrintPDF I saw that text is printed twice e.g. all
> strings are printed by writeFont from the top of the page to the end and
> then again. Is that by design or should I start to look into why that is
> happening? An initial debugging showed that the processing already starts
> repeating in PageDrawer.processTextPosition()
AFAIU this is not a bug, it's a feature. The Pageable interface is used to 
print a PDDocument. The first pass is needed to precalculate some aspects of 
the document to be printed, such as the number of pages, pagesize etc. and the 
second pass is used for the real printing. So IMHO everything is ok.


BR
Andreas Lehmkühler


PageDrawer renders page twice

2010-03-31 Thread Maruan Sahyoun
Hi,

during my debugging of PrintPDF I saw that text is printed twice e.g. all 
strings are printed by writeFont from the top of the page to the end and then 
again. Is that by design or should I start to look into why that is happening? 
An initial debugging showed that the processing already starts repeating in 
PageDrawer.processTextPosition()

Kind regards

Maruan Sahyoun


[jira] Commented: (PDFBOX-675) Upgrade .Net build to use IKVM version 0.42 - Opinions wanted

2010-03-31 Thread Daniel Wilson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851862#action_12851862
 ] 

Daniel Wilson commented on PDFBOX-675:
--

This in from the main man behind IKVM:
"0.42 requires .NET 2.0 SP1"

Nonetheless, .Net 2.0 SP1 or later is sufficiently ubiquitous, that I'm 
comfortable with the requirement.  My testing of the font fix is still in 
progress ...

> Upgrade .Net build to use IKVM version 0.42 - Opinions wanted
> -
>
> Key: PDFBOX-675
> URL: https://issues.apache.org/jira/browse/PDFBOX-675
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Daniel Wilson
>Assignee: Daniel Wilson
>Priority: Minor
>
> The current .Net build script (ant build.NET) is for IKVM 0.38, released 15 
> months ago.
> Since that time, IKVM has grown to support a larger portion of the Java 
> object model.  I am currently investigating the possibility of improved font 
> support, as our IKVM-compiled version crashes if PDType1CFont.prepareAWTFont 
> is called.
> The downside of the upgrade will be loss of support for the .Net 1.1 
> Framework.  In my opinion, that is not a big deal as very few projects still 
> rely on it.
> I welcome opinions before committing any changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[ANNOUNCE] Apache PDFBox 1.1.0 released

2010-03-31 Thread Jukka Zitting
The Apache PDFBox community is pleased to announce the first release
of Apache PDFBox version 1.1.0. The release is available for download
at:

   http://pdfbox.apache.org/download.html

See the full release notes below for details about this release.


Release Notes -- Apache PDFBox -- Version 1.1.0

Introduction


PDFBox is an open source Java library for working with PDF documents.

This is an incremental feature release based on the earlier 1.0.0 release.
Unlike previous PDFBox releases, this release contains also updated versions
of the supporting FontBox and JempBox libraries. The other notable changes
in this release include basic support for tagged PDF, various font handling
improvements and better handling of CJK character sets. For more details,
please refer to the following issues on the PDFBox issue tracker at
https://issues.apache.org/jira/browse/PDFBOX.

New Features

  [PDFBOX-7]   extract information from tagged PDF
  [PDFBOX-48]  Create a tagged PDF
  [PDFBOX-67]  Implement StructTreeRoot/StructTree classes in the PDModel
  [PDFBOX-636] Add decoded stream length to PDStream
  [PDFBOX-640] Add getter/setter for alternate field name (TU) to PDField

Improvements

  [PDFBOX-628] Too many detours in COSDictionary convenience methods
  [PDFBOX-630] Create PDDictionaryWrapper
  [PDFBOX-633] Add indexOfObject and removeObject methods with ...
  [PDFBOX-635] Fallback mechanism for broken CFF fonts
  [PDFBOX-643] Date conversion errors
  [PDFBOX-644] Move FontBox and JempBox under the same trunk with PDFBox
  [PDFBOX-646] Map the form space to user space if the optional form ...
  [PDFBOX-653] Document the missing command line tools
  [PDFBOX-654] Extracting CJK text
  [PDFBOX-655] Default character width should be used if width of a ...
  [PDFBOX-663] Ensuring non-null FontDescriptor for external TrueType fonts

Bug Fixes

  [PDFBOX-55]  Invalid character while extracting text from a chinese pdf
  [PDFBOX-116] PNG image page completely garbled
  [PDFBOX-259] support request chinese-traditional
  [PDFBOX-420] Japanese Characters are garbled.
  [PDFBOX-619] Adobe CFF/Type2 font encoding enhancements
  [PDFBOX-621] XMPSchema.getIntegerProperty does not return existing value
  [PDFBOX-624] Misplaced text
  [PDFBOX-632] Invalid page rendering while printing a PDF with an image ...
  [PDFBOX-634] CFF parsing failure
  [PDFBOX-637] problem with static code in COSInteger/COSNumber
  [PDFBOX-645] PDDocumentOutline should not have getParent()
  [PDFBOX-656] Typo: there is no DecodeParams value. The correct name is ...
  [PDFBOX-658] Fix typo in FontMapping.properties
  [PDFBOX-660] Applying FontMatrix scale factors to PDFont drawing operations
  [PDFBOX-666] Ensure the correct path direction when drawing a rectangle

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://svn.apache.org/repos/asf/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit http://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit http://www.apache.org/


[jira] Updated: (PDFBOX-676) Predefined paper sizes in PDPage are slightly off

2010-03-31 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-676:
--

Description: PDPage predefines several paper sizes. The A paper sizes are 
slightly off by 2 to 3 milimeters. The patch fixes that. In addition to that a 
new constructor is added allowing to specify a paper size when creating a new 
page.  (was: PDModel predefines several paper sizes. The A paper sizes are 
slightly off by 2 to 3 milimeters. The patch fixes that. In addition to that a 
new constructor is added allowing to specify a paper size when creating a new 
page.)
Summary: Predefined paper sizes in PDPage are slightly off  (was: 
Predefined paper sizes in PDModel are slightly off)

> Predefined paper sizes in PDPage are slightly off
> -
>
> Key: PDFBOX-676
> URL: https://issues.apache.org/jira/browse/PDFBOX-676
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: PDPage.patch
>
>
> PDPage predefines several paper sizes. The A paper sizes are slightly off by 
> 2 to 3 milimeters. The patch fixes that. In addition to that a new 
> constructor is added allowing to specify a paper size when creating a new 
> page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PDFBOX-676) Predefined paper sizes in PDModel are slightly off

2010-03-31 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-676:
--

Attachment: PDPage.patch

Patch to correct paper szes in PDPage

> Predefined paper sizes in PDModel are slightly off
> --
>
> Key: PDFBOX-676
> URL: https://issues.apache.org/jira/browse/PDFBOX-676
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: PDPage.patch
>
>
> PDModel predefines several paper sizes. The A paper sizes are slightly off by 
> 2 to 3 milimeters. The patch fixes that. In addition to that a new 
> constructor is added allowing to specify a paper size when creating a new 
> page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PDFBOX-676) Predefined paper sizes in PDModel are slightly off

2010-03-31 Thread Maruan Sahyoun (JIRA)
Predefined paper sizes in PDModel are slightly off
--

 Key: PDFBOX-676
 URL: https://issues.apache.org/jira/browse/PDFBOX-676
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Reporter: Maruan Sahyoun
Priority: Minor


PDModel predefines several paper sizes. The A paper sizes are slightly off by 2 
to 3 milimeters. The patch fixes that. In addition to that a new constructor is 
added allowing to specify a paper size when creating a new page.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.