Re: PDF Text Highlight

2013-07-26 Thread Fred Hansen
Caveat: I've not tried this; nor anything like it. I am answering because 
figuring out how to do it was a challenge.

Presumably your program has variables 'page' and 'document' where the rectangle 
goes and variables llx, lly, w, and h delimiting the rectangle.

Here's some code that might work.  (UNTESTED)


// first construct a stream that draws a yellow rectangle 

//  at the desired coordinates, but on a temporary page
PDPage tempPage = new PDPage();
PDPageContentStream tempStream = new PDPageContentStream(document, tempPage);
tempStream.setNonStrokingColor(0,255,255);    //a version of yellow
tempStream.fillRect(llx, lly, w, h);   //  where to put rect 
tempStream.close();

// now get a handle on the stream (I hope it is not an array)
PDStream yellowStream = tempPage.getContents();

// get the contents of the page 
COSDictionary dict = page.getCOSDictionary();
COSBase pageStream = dict.getDictionaryObject("Contents");

// make sure the contents are a COSArray
COSArray pageStreamArray;
if (pageStream instanceof COSStream) {
    pageStreamArray = new COSArray();
    pageStreamArray.add(pageStream);
    dict.setItem("Contents", pageStreamArray);
}
else pageStreamArray = (COSArray)pageStream;

// now we add yellowStream at the front of page.getContents()
//   (in front so text is later drawn on top of it)

pageStreamArray.add(0, yellowStream );



>
> From: Alin Mazilu 
>To: dev@pdfbox.apache.org 
>Sent: Friday, July 26, 2013 12:33 PM
>Subject: PDF Text Highlight
> 
>
>Hello all,
>
>I have a bit of a situation on my hands. Here it is: I have a bunch of PDF
>files sitting in a folder somewhere. What I have to do is search all of
>them for certain names and highlight those names with a yellow marker-like
>background and then I have to send all PDFs to a printer.
>
>I have done the searching and text extraction and the printing, but for the
>life of me, I can't figure out how to do the highlighting. What makes it
>even harder is that I have hundreds of these PDFs per day and human
>interaction is out of the question. It has to be a push of a button.
>
>Any ideas? I appreciate it.
>
>Alin Mazilu
>
>
>

[jira] [Commented] (PDFBOX-1511) pdfMerger App produces Garbage

2013-07-26 Thread Kirk Haines (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721174#comment-13721174
 ] 

Kirk Haines commented on PDFBOX-1511:
-

I have also experienced this (Windows 7, Java 1.6.0_35-b10 64-bit) in PDFBox 
1.7.1 thru the current trunk.  I tried Maruan's suggestion and it resolved the 
issue, at the expense of creating unnecessary duplicate resources.  I had 
noticed that the corruption in subsequent documents resulted in those pages 
having their formatting preserved, but the text content had many letters 
substituted (all 'd' replaced by 'f', all 'y' replaced by 'd', etc.)  I also 
found that the degree of corruption depended on how similar the beginning text 
content of each input document was.  When there was a common header in the 
documents being merged, there were only a few substitutions.  When it was 
merging a document with itself, there were no errors.  When the document header 
was very different, the resulting text was undecipherable garbage.  This made 
me suspect that it may be a problem with the deflate compression being applied 
to the stream.  I thought that it might be using the (compression) dictionary 
from the first document and copying the physical bytes from the source document 
rather than the reading the logical bytes and allowing the deflate filter in 
the context of the destination document to re-encode them.

> pdfMerger App produces Garbage
> --
>
> Key: PDFBOX-1511
> URL: https://issues.apache.org/jira/browse/PDFBOX-1511
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.7.1
> Environment: Win XP; Windows Server 2008 R2; java version "1.6.0_21", 
>Reporter: Michael Huber
> Attachments: 1.pdf, 2.pdf, PdfRenderer.java, targetPdfMergeJava.pdf, 
> targetPdfMergeUtilityApp.pdf
>
>
> pdfbox Utility pdfMerger produces a merged document containing garbage. All 
> merged pdf files are contained but Strings are destroyed.
> The source pdf files are created with graphviz and are readable without error 
> or disturbance both with Acrobat X and pdfbox pdfDebug Utility.
> Another astoundig thing is that a handcoded merger using pdfMergerUtility 
> class works fine when run within Eclipse Juno and creates same garbage when 
> run from cmd line (pls. see attached source)
> I checked everything that comes in mind to find the differences, e.g. Java 
> version, encoding/codepage issues, memory settings, found nothing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


PDF Text Highlight

2013-07-26 Thread Alin Mazilu
Hello all,

I have a bit of a situation on my hands. Here it is: I have a bunch of PDF
files sitting in a folder somewhere. What I have to do is search all of
them for certain names and highlight those names with a yellow marker-like
background and then I have to send all PDFs to a printer.

I have done the searching and text extraction and the printing, but for the
life of me, I can't figure out how to do the highlighting. What makes it
even harder is that I have hundreds of these PDFs per day and human
interaction is out of the question. It has to be a push of a button.

Any ideas? I appreciate it.

Alin Mazilu


[jira] [Created] (PDFBOX-1674) Preflight doesn't correctly parse PDF if obj identifier not followed by line terminator

2013-07-26 Thread Johan van der Knijff (JIRA)
Johan van der Knijff created PDFBOX-1674:


 Summary: Preflight doesn't correctly parse PDF if obj identifier 
not followed by line terminator
 Key: PDFBOX-1674
 URL: https://issues.apache.org/jira/browse/PDFBOX-1674
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: Win 7
Reporter: Johan van der Knijff
Priority: Minor
 Fix For: 2.0.0


For some test files on the Adobe Acrobat Engineering website, Preflight output 
looks like this:


  210
  false
  

  1.0
  Syntax error, Expected pattern 'obj but missed at character 
'o'


  1.2.1
  Body Syntax error, Expected pattern 'obj but missed at character 
'o'


  1.2.1
  Body Syntax error, Single space expected

  


Which suggests that Preflight doesn't correctly parse the objects. This is 
confirmed by a look at some of the offending PDFs in a hex editor, which 
reveals that the object identifiers in them are not terminated by any EOL 
character(s). AFAIK this is allowed in both PDF and PDF/A-1. More details + 
links to test files here ('Multimedia' table and below):

http://www.openplanetsfoundation.org/blogs/2013-07-25-identification-pdf-preservation-risks-sequel


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira