Re: PDF Quality

2011-01-14 Thread Andreas Lehmkühler
Hi,

Gesendet: Fr, 14. Jan 2011
Von: Olivier DOREMIEUXoliv...@doremieux.org

 I wasn't too happy with the quality of the generated PDF, especially 
 because I am generating the PDF from images, so I did play a little bit 
 with PDJpeg
 
 In public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException {
 
 I did replace :
 
 //ImageIO.write(bi, jpeg, os);
 
 By
 
  ImageWriter writer = null;
  Iterator iter = ImageIO.getImageWritersByFormatName(jpg);
  if (iter.hasNext()) {
  writer = (ImageWriter) iter.next();
  }
 
  ImageOutputStream ios = ImageIO.createImageOutputStream(os);
  writer.setOutput(ios);
 
  // Set the compression quality
  JPEGImageWriteParam iwparam = new 
 JPEGImageWriteParam(Locale.getDefault());
  iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
  iwparam.setCompressionQuality(1.0f);
 
  // Write the image
  writer.write(null, new IIOImage(bi, null, null), iwparam);
 
  writer.dispose();
 
 
 The quality is much better. I think by default the compression quality 
 is 0.75
 The PDF file is bigger. So maybe we could have a global parameter to set 
 the desired quality, and use it in iwparam.setCompressionQuality(1.0f);
 
 
 Hope this help and which that change can be integrated.
Looks interesting. Thanks for the contribution. Please file an issue on JIRA [1]
and attach a patch containing a diff against the current trunk. Don't forget to
check the Grant license to ASF... checkbox.

BR
Andreas Lehmkühler


[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-14 Thread Mel Martinez (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981815#action_12981815
 ] 

Mel Martinez commented on PDFBOX-588:
-

Wow.

That is wierd.

It only takes 40 seconds to extract from PDF Ref v1.7 on my box, with PDFBox 
v1.4.0.

Do you maybe have a font file that is heavily fragmented or something like that?


 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
 Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
 PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PDFBOX-942) Image quality improvements

2011-01-14 Thread Olivier DOREMIEUX (JIRA)
Image quality improvements
--

 Key: PDFBOX-942
 URL: https://issues.apache.org/jira/browse/PDFBOX-942
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.4.0
Reporter: Olivier DOREMIEUX


The quality of the images inserted in a PDF documents could be improved by 
changing PDJpeg.java
In the API
public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException

ImageIO.write(bi, jpeg, os);

could be replaced by :

  ImageWriter writer = null;
  Iterator iter = ImageIO.getImageWritersByFormatName(jpg);
  if (iter.hasNext()) {
  writer = (ImageWriter) iter.next();
  }
 
  ImageOutputStream ios = ImageIO.createImageOutputStream(os);
  writer.setOutput(ios);
 
  // Set the compression quality
  JPEGImageWriteParam iwparam = new 
JPEGImageWriteParam(Locale.getDefault());
  iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
  iwparam.setCompressionQuality(1.0f);
 
  // Write the image
  writer.write(null, new IIOImage(bi, null, null), iwparam);
 
  writer.dispose();

This increase the size of the generated PDF.
By default the JPEG quality is 0.75, in the patch I use 1.0, the maximum quality
As a suggestion the quality of the JPEG could be a global variable since it 
affect the size of the PDF


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PDFBOX-942) Image quality improvements

2011-01-14 Thread Olivier DOREMIEUX (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier DOREMIEUX updated PDFBOX-942:
-

Attachment: PDJpeg.patch

 Image quality improvements
 --

 Key: PDFBOX-942
 URL: https://issues.apache.org/jira/browse/PDFBOX-942
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.4.0
Reporter: Olivier DOREMIEUX
 Attachments: PDJpeg.patch


 The quality of the images inserted in a PDF documents could be improved by 
 changing PDJpeg.java
 In the API
 public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException
 ImageIO.write(bi, jpeg, os);
 could be replaced by :
   ImageWriter writer = null;
   Iterator iter = ImageIO.getImageWritersByFormatName(jpg);
   if (iter.hasNext()) {
   writer = (ImageWriter) iter.next();
   }
  
   ImageOutputStream ios = ImageIO.createImageOutputStream(os);
   writer.setOutput(ios);
  
   // Set the compression quality
   JPEGImageWriteParam iwparam = new 
 JPEGImageWriteParam(Locale.getDefault());
   iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
   iwparam.setCompressionQuality(1.0f);
  
   // Write the image
   writer.write(null, new IIOImage(bi, null, null), iwparam);
  
   writer.dispose();
 This increase the size of the generated PDF.
 By default the JPEG quality is 0.75, in the patch I use 1.0, the maximum 
 quality
 As a suggestion the quality of the JPEG could be a global variable since it 
 affect the size of the PDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

2011-01-14 Thread Hesham (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981910#action_12981910
 ] 

Hesham commented on PDFBOX-588:
---

I do not know what is a fragmented font !
But i have created a sample project to test extracting text from the PDF 
reference, and it took the same time i mentioned for the 2 PDFBox versions. I 
do not understand how it works fine with you !

Here is my code :
private void readPDFButtonActionPerformed() {
try {
PDDocument pdfRef = PDDocument.load( 
C:\\pdf_reference_1.7.pdf );
PDFTextStripper stripper = new PDFTextStripper();

for( int pageNum = 1; pageNum  pdfRef.getNumberOfPages(); 
pageNum++ ) {
System.out.println( pageNum );
stripper.setStartPage( pageNum );
stripper.setEndPage( pageNum );
stripper.getText( pdfRef ); 
}
System.out.println( Done );
} catch (IOException e) {
e.printStackTrace();
}
}

 Problem extracting text in newline characters
 -

 Key: PDFBOX-588
 URL: https://issues.apache.org/jira/browse/PDFBOX-588
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
 Environment: Win XP
Reporter: Hesham
Assignee: Andreas Lehmkühler
 Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
 PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
 PDFTextStripper.patch


 Hello ,
  
 I have a PDF file with 1 page only, when I try to extract its text using :
 String pageData = stripper.getText( pdfFile );
 It ignores some Enter characters between lines, so the last word in the line 
 and the first word in the next line appear as 1 word without spaces between 
 them !!
 While if I copy the PDF text manually from the PDF and paste it in a text 
 editor, Enter characters appear after the same lines that caused the problem 
 in PDFBox.
 Please check the attached file as a sample.
  
 Is there a way to fix this ?
  
 Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.