Re: PDF Quality
Hi, Gesendet: Fr, 14. Jan 2011 Von: Olivier DOREMIEUXoliv...@doremieux.org I wasn't too happy with the quality of the generated PDF, especially because I am generating the PDF from images, so I did play a little bit with PDJpeg In public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException { I did replace : //ImageIO.write(bi, jpeg, os); By ImageWriter writer = null; Iterator iter = ImageIO.getImageWritersByFormatName(jpg); if (iter.hasNext()) { writer = (ImageWriter) iter.next(); } ImageOutputStream ios = ImageIO.createImageOutputStream(os); writer.setOutput(ios); // Set the compression quality JPEGImageWriteParam iwparam = new JPEGImageWriteParam(Locale.getDefault()); iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT); iwparam.setCompressionQuality(1.0f); // Write the image writer.write(null, new IIOImage(bi, null, null), iwparam); writer.dispose(); The quality is much better. I think by default the compression quality is 0.75 The PDF file is bigger. So maybe we could have a global parameter to set the desired quality, and use it in iwparam.setCompressionQuality(1.0f); Hope this help and which that change can be integrated. Looks interesting. Thanks for the contribution. Please file an issue on JIRA [1] and attach a patch containing a diff against the current trunk. Don't forget to check the Grant license to ASF... checkbox. BR Andreas Lehmkühler
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981815#action_12981815 ] Mel Martinez commented on PDFBOX-588: - Wow. That is wierd. It only takes 40 seconds to extract from PDF Ref v1.7 on my box, with PDFBox v1.4.0. Do you maybe have a font file that is heavily fragmented or something like that? Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PDFBOX-942) Image quality improvements
Image quality improvements -- Key: PDFBOX-942 URL: https://issues.apache.org/jira/browse/PDFBOX-942 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.4.0 Reporter: Olivier DOREMIEUX The quality of the images inserted in a PDF documents could be improved by changing PDJpeg.java In the API public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException ImageIO.write(bi, jpeg, os); could be replaced by : ImageWriter writer = null; Iterator iter = ImageIO.getImageWritersByFormatName(jpg); if (iter.hasNext()) { writer = (ImageWriter) iter.next(); } ImageOutputStream ios = ImageIO.createImageOutputStream(os); writer.setOutput(ios); // Set the compression quality JPEGImageWriteParam iwparam = new JPEGImageWriteParam(Locale.getDefault()); iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT); iwparam.setCompressionQuality(1.0f); // Write the image writer.write(null, new IIOImage(bi, null, null), iwparam); writer.dispose(); This increase the size of the generated PDF. By default the JPEG quality is 0.75, in the patch I use 1.0, the maximum quality As a suggestion the quality of the JPEG could be a global variable since it affect the size of the PDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-942) Image quality improvements
[ https://issues.apache.org/jira/browse/PDFBOX-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier DOREMIEUX updated PDFBOX-942: - Attachment: PDJpeg.patch Image quality improvements -- Key: PDFBOX-942 URL: https://issues.apache.org/jira/browse/PDFBOX-942 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.4.0 Reporter: Olivier DOREMIEUX Attachments: PDJpeg.patch The quality of the images inserted in a PDF documents could be improved by changing PDJpeg.java In the API public PDJpeg(PDDocument doc, BufferedImage bi) throws IOException ImageIO.write(bi, jpeg, os); could be replaced by : ImageWriter writer = null; Iterator iter = ImageIO.getImageWritersByFormatName(jpg); if (iter.hasNext()) { writer = (ImageWriter) iter.next(); } ImageOutputStream ios = ImageIO.createImageOutputStream(os); writer.setOutput(ios); // Set the compression quality JPEGImageWriteParam iwparam = new JPEGImageWriteParam(Locale.getDefault()); iwparam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT); iwparam.setCompressionQuality(1.0f); // Write the image writer.write(null, new IIOImage(bi, null, null), iwparam); writer.dispose(); This increase the size of the generated PDF. By default the JPEG quality is 0.75, in the patch I use 1.0, the maximum quality As a suggestion the quality of the JPEG could be a global variable since it affect the size of the PDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981910#action_12981910 ] Hesham commented on PDFBOX-588: --- I do not know what is a fragmented font ! But i have created a sample project to test extracting text from the PDF reference, and it took the same time i mentioned for the 2 PDFBox versions. I do not understand how it works fine with you ! Here is my code : private void readPDFButtonActionPerformed() { try { PDDocument pdfRef = PDDocument.load( C:\\pdf_reference_1.7.pdf ); PDFTextStripper stripper = new PDFTextStripper(); for( int pageNum = 1; pageNum pdfRef.getNumberOfPages(); pageNum++ ) { System.out.println( pageNum ); stripper.setStartPage( pageNum ); stripper.setEndPage( pageNum ); stripper.getText( pdfRef ); } System.out.println( Done ); } catch (IOException e) { e.printStackTrace(); } } Problem extracting text in newline characters - Key: PDFBOX-588 URL: https://issues.apache.org/jira/browse/PDFBOX-588 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0 Environment: Win XP Reporter: Hesham Assignee: Andreas Lehmkühler Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch Hello , I have a PDF file with 1 page only, when I try to extract its text using : String pageData = stripper.getText( pdfFile ); It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !! While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox. Please check the attached file as a sample. Is there a way to fix this ? Best regards , -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.