[jira] [Commented] (PDFBOX-2054) Remove System.out.println()
[ https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988598#comment-13988598 ] Tilman Hausherr commented on PDFBOX-2054: - Committed a first round in rev 1592153 for the trunk and rev 1592155 for the 1.8 branch. Remove System.out.println() --- Key: PDFBOX-2054 URL: https://issues.apache.org/jira/browse/PDFBOX-2054 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Hong-Thai Nguyen Assignee: Tilman Hausherr Priority: Minor For example at GlyfSimpleDescript.java {code} ... catch (ArrayIndexOutOfBoundsException e) { System.out.println(error: array index out of bounds); } {code} and also 'printStackTrace' like in PageDrawer.java: {code} ... catch( IOException io ) { io.printStackTrace(); } {code} Should forward exception or keep silence. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading
[ https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988602#comment-13988602 ] Tilman Hausherr commented on PDFBOX-2053: - This is very similar to PDFBOX-62, although the fix I proposed there doesn't work there, for a reason that I don't know yet. Issue with PDFBox position reading -- Key: PDFBOX-2053 URL: https://issues.apache.org/jira/browse/PDFBOX-2053 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.3 Reporter: Orbel Mkrtchyan Attachments: test.pdf Using PDFBox 1.8.4, bug #1: PDDocument doc = new PDDocument(); doc.load(test-pcc7247.pdf); doc.save(out.pdf); doc.close(); The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader. bug #2: consider the following code snippet. The code runs like this: Extractor extractor = new Extractor(); extractor.writeText(pdDoc, output); Using the code defined like this: public class Extractor extends PDFTextStripper { ... protected void writePage() throws IOException { for( int i = 0; i charactersByArticle.size(); i++) { ListTextPosition textList = charactersByArticle.get( i ); Iterator textIter = textList.iterator(); while( textIter.hasNext() ) { TextPosition position = (TextPosition)textIter.next(); In the given piece of code, position variable correctly iterates through the letters of the first line of the provided pdf document, but its coordinates (x, y, widths, etc) are always the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's length always equals 1. So we get the same coordinates for every letter in a line. Expected behaviour is either having new coordinates per letter or having widths[] contain widths for the characters of a whole line of text -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream
[ https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988605#comment-13988605 ] Ajay Bhat commented on PDFBOX-1584: --- Has this patch been commited yet? Add unit test for RandomAccessFileOutputStream -- Key: PDFBOX-1584 URL: https://issues.apache.org/jira/browse/PDFBOX-1584 Project: PDFBox Issue Type: Test Components: Writing Affects Versions: 1.8.1 Reporter: Fredrik Kjellberg Priority: Minor Attachments: TestRandomAccessFileOutputStream_diff.txt This patch includes a unit test for RandomAccessFileOutputStream -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream
[ https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988605#comment-13988605 ] Ajay Bhat edited comment on PDFBOX-1584 at 5/3/14 7:07 AM: --- Has this patch been committed? was (Author: ajay bhat): Has this patch been commited yet? Add unit test for RandomAccessFileOutputStream -- Key: PDFBOX-1584 URL: https://issues.apache.org/jira/browse/PDFBOX-1584 Project: PDFBox Issue Type: Test Components: Writing Affects Versions: 1.8.1 Reporter: Fredrik Kjellberg Priority: Minor Attachments: TestRandomAccessFileOutputStream_diff.txt This patch includes a unit test for RandomAccessFileOutputStream -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream
[ https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988643#comment-13988643 ] Tilman Hausherr commented on PDFBOX-1584: - No. This happens sometimes :-( I looked at the patch, then I looked at the timeline of the submitter ([~frkj]), this is actually part of a bigger discussion about problems with RandomAccessFileOutputStream and RandomAccessFileInputStream. I will try to understand what this is about. Be patient Add unit test for RandomAccessFileOutputStream -- Key: PDFBOX-1584 URL: https://issues.apache.org/jira/browse/PDFBOX-1584 Project: PDFBox Issue Type: Test Components: Writing Affects Versions: 1.8.1 Reporter: Fredrik Kjellberg Priority: Minor Attachments: TestRandomAccessFileOutputStream_diff.txt This patch includes a unit test for RandomAccessFileOutputStream -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs
[ https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-62: -- Attachment: (was: PDTrueTypeFont.diff) Incorrect (zero) character widths returned in some docs --- Key: PDFBOX-62 URL: https://issues.apache.org/jira/browse/PDFBOX-62 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Assignee: Andreas Lehmkühler Attachments: 5542.pdf, pdfbox-2006-zerowidth.pdf-1.png, pdfbox-62-zerowidth.pdf-1.png [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1216674 Originally submitted by tamirhassan on 2005-06-07 13:42. For certain PDF documents (such as the one attached) the character/string widths (as obtained e.g. by the PDFont.getStringWidth method) are not returned correctly, i.e. they appear to be correct for punctuation characters but are zero for alphanumeric characters. It seems as if these alphanumeric characters are NOT within PDFont.firstChar and PDFont.lastChar in the Type 1 font. The method therefore attempts to obtain the font widths from the AFM (font metric) file, but fails (silently) with a 'resource is null' logline message. (Note that this problem doesn't seem to occur with Type 1 fonts in other documents.) A more detailed discussion regarding this issue can be found in this link: http://sourceforge.net/forum/forum.php? thread_id=1260349forum_id=267205 Thanks in advance for any help that can be obtained, Tam -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs
[ https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-62: -- Attachment: PDTrueTypeFont.diff Updated patch to handle the file from PDFBOX-2053, it maps missing Arial withs to Helvetica AFM files. Incorrect (zero) character widths returned in some docs --- Key: PDFBOX-62 URL: https://issues.apache.org/jira/browse/PDFBOX-62 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Assignee: Andreas Lehmkühler Attachments: 5542.pdf, PDTrueTypeFont.diff, pdfbox-2006-zerowidth.pdf-1.png, pdfbox-62-zerowidth.pdf-1.png [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1216674 Originally submitted by tamirhassan on 2005-06-07 13:42. For certain PDF documents (such as the one attached) the character/string widths (as obtained e.g. by the PDFont.getStringWidth method) are not returned correctly, i.e. they appear to be correct for punctuation characters but are zero for alphanumeric characters. It seems as if these alphanumeric characters are NOT within PDFont.firstChar and PDFont.lastChar in the Type 1 font. The method therefore attempts to obtain the font widths from the AFM (font metric) file, but fails (silently) with a 'resource is null' logline message. (Note that this problem doesn't seem to occur with Type 1 fonts in other documents.) A more detailed discussion regarding this issue can be found in this link: http://sourceforge.net/forum/forum.php? thread_id=1260349forum_id=267205 Thanks in advance for any help that can be obtained, Tam -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading
[ https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988657#comment-13988657 ] Tilman Hausherr commented on PDFBOX-2053: - I updated my fix for PDFBOX-62 and rendering works for me. The problems that you get with your Extractor class are there because of the zero widths problem. Re bug1: correct your code to {code} PDDocument doc = PDDocument.load(test-pcc7247.pdf); doc.save(out.pdf); doc.close(); {code} Issue with PDFBox position reading -- Key: PDFBOX-2053 URL: https://issues.apache.org/jira/browse/PDFBOX-2053 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.3 Reporter: Orbel Mkrtchyan Attachments: test.pdf Using PDFBox 1.8.4, bug #1: PDDocument doc = new PDDocument(); doc.load(test-pcc7247.pdf); doc.save(out.pdf); doc.close(); The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader. bug #2: consider the following code snippet. The code runs like this: Extractor extractor = new Extractor(); extractor.writeText(pdDoc, output); Using the code defined like this: public class Extractor extends PDFTextStripper { ... protected void writePage() throws IOException { for( int i = 0; i charactersByArticle.size(); i++) { ListTextPosition textList = charactersByArticle.get( i ); Iterator textIter = textList.iterator(); while( textIter.hasNext() ) { TextPosition position = (TextPosition)textIter.next(); In the given piece of code, position variable correctly iterates through the letters of the first line of the provided pdf document, but its coordinates (x, y, widths, etc) are always the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's length always equals 1. So we get the same coordinates for every letter in a line. Expected behaviour is either having new coordinates per letter or having widths[] contain widths for the characters of a whole line of text -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (PDFBOX-2053) Issue with PDFBox position reading
[ https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2053. --- Resolution: Duplicate Issue with PDFBox position reading -- Key: PDFBOX-2053 URL: https://issues.apache.org/jira/browse/PDFBOX-2053 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.3 Reporter: Orbel Mkrtchyan Attachments: test.pdf Using PDFBox 1.8.4, bug #1: PDDocument doc = new PDDocument(); doc.load(test-pcc7247.pdf); doc.save(out.pdf); doc.close(); The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader. bug #2: consider the following code snippet. The code runs like this: Extractor extractor = new Extractor(); extractor.writeText(pdDoc, output); Using the code defined like this: public class Extractor extends PDFTextStripper { ... protected void writePage() throws IOException { for( int i = 0; i charactersByArticle.size(); i++) { ListTextPosition textList = charactersByArticle.get( i ); Iterator textIter = textList.iterator(); while( textIter.hasNext() ) { TextPosition position = (TextPosition)textIter.next(); In the given piece of code, position variable correctly iterates through the letters of the first line of the provided pdf document, but its coordinates (x, y, widths, etc) are always the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's length always equals 1. So we get the same coordinates for every letter in a line. Expected behaviour is either having new coordinates per letter or having widths[] contain widths for the characters of a whole line of text -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object
[ https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988671#comment-13988671 ] Rogério Pereira Araújo commented on PDFBOX-1122: I can confirm the same error on version 1.8.4 while parsing PDFs with Tika during Nutch parsing job. Parsing Error, Skipping Object -- Key: PDFBOX-1122 URL: https://issues.apache.org/jira/browse/PDFBOX-1122 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0 Environment: Working with Windows 7 in eclipse. Reporter: Raihan Jamal Assignee: Andreas Lehmkühler Labels: pdfbox Original Estimate: 336h Remaining Estimate: 336h Parsing Error, Skipping Object java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@38011d45 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:357) at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) at java.lang.Thread.run(Thread.java:662) Did not found XRef object at specified startxref position 0 This is the sample URL where I am facing this problem:- http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf Any suggestions why is it happening...!! Or its a bug?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object
[ https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988685#comment-13988685 ] Tilman Hausherr commented on PDFBOX-1122: - The old URL no longer works. What file did you use? Parsing Error, Skipping Object -- Key: PDFBOX-1122 URL: https://issues.apache.org/jira/browse/PDFBOX-1122 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0 Environment: Working with Windows 7 in eclipse. Reporter: Raihan Jamal Assignee: Andreas Lehmkühler Labels: pdfbox Original Estimate: 336h Remaining Estimate: 336h Parsing Error, Skipping Object java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@38011d45 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:357) at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) at java.lang.Thread.run(Thread.java:662) Did not found XRef object at specified startxref position 0 This is the sample URL where I am facing this problem:- http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf Any suggestions why is it happening...!! Or its a bug?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream
[ https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-1584. - Resolution: Fixed Fix Version/s: 2.0.0 1.8.6 Committed in rev 1592225 for the trunk and 1592226 for the 1.8 branch. Thanks! I won't commit PDFBOX-1582 for now as this doesn't seem to be complete. Add unit test for RandomAccessFileOutputStream -- Key: PDFBOX-1584 URL: https://issues.apache.org/jira/browse/PDFBOX-1584 Project: PDFBox Issue Type: Test Components: Writing Affects Versions: 1.8.1 Reporter: Fredrik Kjellberg Priority: Minor Fix For: 1.8.6, 2.0.0 Attachments: TestRandomAccessFileOutputStream_diff.txt This patch includes a unit test for RandomAccessFileOutputStream -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2054) Remove System.out.println()
[ https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988749#comment-13988749 ] Tilman Hausherr commented on PDFBOX-2054: - Committed a second round in rev 1592251 for the trunk and rev 1592252 for the 1.8 branch. Remove System.out.println() --- Key: PDFBOX-2054 URL: https://issues.apache.org/jira/browse/PDFBOX-2054 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Hong-Thai Nguyen Assignee: Tilman Hausherr Priority: Minor For example at GlyfSimpleDescript.java {code} ... catch (ArrayIndexOutOfBoundsException e) { System.out.println(error: array index out of bounds); } {code} and also 'printStackTrace' like in PageDrawer.java: {code} ... catch( IOException io ) { io.printStackTrace(); } {code} Should forward exception or keep silence. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (PDFBOX-2054) Remove System.out.println()
[ https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-2054. - Resolution: Fixed Fix Version/s: 2.0.0 1.8.6 I'm done. I didn't touch preflight, examples and the GUI tools. Thanks for pointing us to these legacy problems. Remove System.out.println() --- Key: PDFBOX-2054 URL: https://issues.apache.org/jira/browse/PDFBOX-2054 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Hong-Thai Nguyen Assignee: Tilman Hausherr Priority: Minor Fix For: 1.8.6, 2.0.0 For example at GlyfSimpleDescript.java {code} ... catch (ArrayIndexOutOfBoundsException e) { System.out.println(error: array index out of bounds); } {code} and also 'printStackTrace' like in PageDrawer.java: {code} ... catch( IOException io ) { io.printStackTrace(); } {code} Should forward exception or keep silence. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect
[ https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2045: Component/s: AcroForm Merging PDFs has no effect -- Key: PDFBOX-2045 URL: https://issues.apache.org/jira/browse/PDFBOX-2045 Project: PDFBox Issue Type: Bug Components: AcroForm, Utilities Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Gerhard Temper Attachments: specialpdf.pdf Merging attached PDF results in a PDF consisting only of the special PDF ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect
[ https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2045: Affects Version/s: 2.0.0 1.8.6 1.8.5 Merging PDFs has no effect -- Key: PDFBOX-2045 URL: https://issues.apache.org/jira/browse/PDFBOX-2045 Project: PDFBox Issue Type: Bug Components: AcroForm, Utilities Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Gerhard Temper Attachments: specialpdf.pdf Merging attached PDF results in a PDF consisting only of the special PDF ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect
[ https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2045: Description: Merging attached special PDF (a form) results in a PDF consisting only of the PDF form ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf was: Merging attached PDF results in a PDF consisting only of the special PDF ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf Merging PDFs has no effect -- Key: PDFBOX-2045 URL: https://issues.apache.org/jira/browse/PDFBOX-2045 Project: PDFBox Issue Type: Bug Components: AcroForm, Utilities Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Gerhard Temper Attachments: specialpdf.pdf Merging attached special PDF (a form) results in a PDF consisting only of the PDF form ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: 1.8.5 and JIRA
Hi, Am 03.05.2014 07:45, schrieb Tilman Hausherr: Hallo Andreas, Thanks for all your work; only one thing is missing, 1.8.5 is still listed as unreleased version in JIRA, e.g. here: https://issues.apache.org/jira/browse/PDFBOX/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel I wanted to wait until to release is official announced. Thanks for the reminder. Tilman Am 02.05.2014 09:27, schrieb Andreas Lehmkühler: Hi, due to the newest PDFBox 1.8.5 release I've closed all 1.8.5 related issues in a bulk operation. I've disabled the email notification to avoid an email flood. I've also added the all new version 1.8.6 for our next bugfix release ... I'll update the download page once the mirrors copied the version from our repository. BR Andreas Lehmkühler BR Andreas Lehmkühler
[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect
[ https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988772#comment-13988772 ] Tilman Hausherr commented on PDFBOX-2045: - I slighly clarified your text above to mention that it is a form. The weird thing is that PDFBox renders all pages, and so does GSView. Only Acrobat Viewer doesn't. Merging PDFs has no effect -- Key: PDFBOX-2045 URL: https://issues.apache.org/jira/browse/PDFBOX-2045 Project: PDFBox Issue Type: Bug Components: AcroForm, Utilities Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Gerhard Temper Attachments: specialpdf.pdf Merging attached special PDF (a form) results in a PDF consisting only of the PDF form ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect
[ https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988774#comment-13988774 ] Tilman Hausherr commented on PDFBOX-2045: - I tried your test with the file at http://www.studentenwerk-berlin.de/wohnen/dokumente/41%20%7C%20Anmeldung.pdf which has a form on page 3 and 4. After merging, all pages can be displayed with Acrobat, but the form capability is lost. Merging PDFs has no effect -- Key: PDFBOX-2045 URL: https://issues.apache.org/jira/browse/PDFBOX-2045 Project: PDFBox Issue Type: Bug Components: AcroForm, Utilities Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0 Reporter: Gerhard Temper Attachments: specialpdf.pdf Merging attached special PDF (a form) results in a PDF consisting only of the PDF form ignoring all other PDFs without any error. Command line to reproduce the problem: java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (PDFBOX-2033) Narrow long pdf is printed blank
[ https://issues.apache.org/jira/browse/PDFBOX-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2033. --- Resolution: Not a Problem Thanks, closing this, in retrospect, this was rather a howto question and not a bug. Narrow long pdf is printed blank Key: PDFBOX-2033 URL: https://issues.apache.org/jira/browse/PDFBOX-2033 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Environment: W7 Reporter: Tilman Hausherr Based on the post of Norbert Sándor to the user list: When printing a 198.425 x 1700.787 sized page (70 x 600 mm) using new PDFPrinter(pdfDocument, printJob).silentPrint() to a virtual printer (e.g. PDFCreator or CIB), then the resulting PDF has a page size of 8,26x11,69 and the content is horizontally centered on the page. I was able to reproduce the problem, and also to print on a virtual printer that creates new PDFs of that size, by using the longest constructor of PDFPrinter(), but then the output is blank. While it doesn't seem useful to print a PDF to a PDF, the problem might make sense when printing to a cash register receipt printer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-1961) Page with annotations renders fine with 1.8 but not with 2.0
[ https://issues.apache.org/jira/browse/PDFBOX-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-1961: Labels: Annotations regression (was: regression) Page with annotations renders fine with 1.8 but not with 2.0 Key: PDFBOX-1961 URL: https://issues.apache.org/jira/browse/PDFBOX-1961 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: Annotations, regression Fix For: 2.0.0 Attachments: annots.pdf, annots.pdf-2-v18.png, annots.pdf-2-v2.png Page 2 of the attached PDF (from a ghostscript installation) renders fine with 1.8 but not with 2.0. The other pages are not rendered properly with any version. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2014) PDAnnotationLink
[ https://issues.apache.org/jira/browse/PDFBOX-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988811#comment-13988811 ] Tilman Hausherr commented on PDFBOX-2014: - Do you have a sample PDF that has what you want? PDAnnotationLink - Key: PDFBOX-2014 URL: https://issues.apache.org/jira/browse/PDFBOX-2014 Project: PDFBox Issue Type: Bug Affects Versions: 1.6.0 Reporter: WALID CHARFI Hi, I want to draw a text link without any hover effect, neither solid border. I tried this code but it does not work. Could you provide me with a solution please? thank you very much. PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary(); borderULine.setStyle(PDBorderStyleDictionary.STYLE_INSET); PDAnnotationLink txtLink = new PDAnnotationLink(); txtLink.setRectangle(position); PDActionURI action = new PDActionURI(); action.setURI(pdfPara.getUri()); txtLink.setAction(action); txtLink.setBorderStyle(borderULine); txtLink.setHighlightMode(PDAnnotationLink.HIGHLIGHT_MODE_NONE); -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (PDFBOX-167) wrong words highlighted
[ https://issues.apache.org/jira/browse/PDFBOX-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-167. -- Resolution: Cannot Reproduce On october 2013, I e-mailed both people mentioned in this issue: {quote} Is this still an issue? I looked at the code and it is different than the one mentioned. But I can't test the code mentioned because the links are broken. {quote} I never got a response. I am thus closing this issue. wrong words highlighted --- Key: PDFBOX-167 URL: https://issues.apache.org/jira/browse/PDFBOX-167 Project: PDFBox Issue Type: Bug Priority: Minor [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1487217 Originally submitted by nobody on 2006-05-12 01:51. PDFBox appears to have a problem properly highlighting words from the following PDF. I am using a very simple servlet to do this, and it works fine for most PDFs. With this one, however, it highlights the wrong words. Unfortunately I am not smart enough to figure out what is going on myself, so could anybody help me with this? The files can be found here: http://www.impressie.nl/matthijs/PDFHighlight.java http://www.impressie.nl/matthijs/Rectificatie%20van%20Richtlijn%20Handhaving%20van%20Intellectuele-eigendomsrechten.pdf Matthijs Bierman matth...@impressie.nl [comment on SourceForge] Originally sent by nobody. Logged In: NO That document is in a password-protected area, so it can't be read by anyone else! I have a similar problem with this doc: http://www.usc.edu/schools/business/FBE/seminars/papers/AE_4-28-06_FISMAN-parking.pdf ... but I think I've figured this one out. The second page of this document is entirely blank, and checking by hand I can see that the highlights after p1 are all in positions that would be correct if they were one page further on; it appears that the page count isn't being incremented for the blank page. Tracing this back in the code I see this: PDStream contentStream = nextPage.getContents(); if( contentStream != null ) { COSStream contents = contentStream.getStream(); processPage( nextPage, contents ); } (PDFTextStripper.java line 255). That's skipping the blank page and giving me the wrong page no, I think - and I guess that the problem can be resolved by moving currentPageNo++ from inside processPage to just above that test. -- brian.ew...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1731) Converting pdf to Image
[ https://issues.apache.org/jira/browse/PDFBOX-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988814#comment-13988814 ] Tilman Hausherr commented on PDFBOX-1731: - [~paulocamargomello] does your problem still happen with the current (just released) version? We have lessened the memory footprint somewhat. Converting pdf to Image --- Key: PDFBOX-1731 URL: https://issues.apache.org/jira/browse/PDFBOX-1731 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.2 Environment: Windows 8 and Linux JDK 1.7 Reporter: Paulo R C Mello Junior Labels: newbie I'm trying to convert a pdf page to image but an exception occurs: 17:28:20,652 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] (Thread-69) Something went wrong ... the pixelmap doesn't contain any data. 17:28:20,654 WARN [org.apache.pdfbox.util.operator.pagedrawer.Invoke] (Thread-69) getRGBImage returned NULL 17:28:20,661 INFO [org.apache.pdfbox.util.PDFStreamEngine] (Thread-69) unsupported/disabled operation: i 17:28:36,809 ERROR [stderr] (Thread-70) Exception in thread Thread-70 java.lang.OutOfMemoryError: Java heap space 17:28:36,811 ERROR [stderr] (Thread-70) at java.awt.image.DataBufferByte.init(DataBufferByte.java:92) 17:28:36,812 ERROR [stderr] (Thread-70) at java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:415) 17:28:36,814 ERROR [stderr] (Thread-70) at java.awt.image.Raster.createWritableRaster(Raster.java:941) 17:28:36,814 ERROR [stderr] (Thread-70) at javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1073) 17:28:36,815 ERROR [stderr] (Thread-70) at javax.imageio.ImageReader.getDestination(ImageReader.java:2896) 17:28:36,816 ERROR [stderr] (Thread-70) at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1066) 17:28:36,817 ERROR [stderr] (Thread-70) at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1034) 17:28:36,818 ERROR [stderr] (Thread-70) at javax.imageio.ImageIO.read(ImageIO.java:1448) 17:28:36,818 ERROR [stderr] (Thread-70) at javax.imageio.ImageIO.read(ImageIO.java:1352) 17:28:36,819 ERROR [stderr] (Thread-70) at org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg.getRGBImage(PDJpeg.java:264) 17:28:36,820 ERROR [stderr] (Thread-70) at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:83) 17:28:36,821 ERROR [stderr] (Thread-70) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) 17:28:36,823 ERROR [stderr] (Thread-70) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) 17:28:36,824 ERROR [stderr] (Thread-70) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) 17:28:36,825 ERROR [stderr] (Thread-70) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) 17:28:36,826 ERROR [stderr] (Thread-70) at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:125) 17:28:36,827 ERROR [stderr] (Thread-70) at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:769) My code: public static ListBufferedImage getPdfPagesAsImages(String pdfPath) throws IOException { File f = new File(pdfPath); PDDocument pdfDocument = null; pdfDocument = PDDocument.loadNonSeq(f, null); ListBufferedImage bImages = new ArrayListBufferedImage(); try { System.out.println(pdfPath); int resolution = 185; if (pdfDocument != null) { @SuppressWarnings(unchecked) ListPDPage pages = (ListPDPage) pdfDocument .getDocumentCatalog().getAllPages(); for (PDPage p : pages) { BufferedImage convertedImage = p.convertToImage( BufferedImage.TYPE_INT_RGB, resolution); if (isNegativeImage(convertedImage)) { bImages.add(invertNegativeImage(convertedImage)); } else { bImages.add(convertedImage); } } } } catch (FileNotFoundException e) { e.printStackTrace(); e.getMessage(); e.getCause(); }
[jira] [Closed] (PDFBOX-9) HTML - PDF
[ https://issues.apache.org/jira/browse/PDFBOX-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-9. Resolution: Won't Fix This is beyond the scope of PDFBox. You can print your HTML to a virtual printer who will create a PDF. HTML - PDF --- Key: PDFBOX-9 URL: https://issues.apache.org/jira/browse/PDFBOX-9 Project: PDFBox Issue Type: New Feature Components: Utilities [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=841169 Originally submitted by nobody on 2003-11-12 20:03. It would be really nice to take a html and create a PDF from it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (PDFBOX-6) PDF to HTML conversion
[ https://issues.apache.org/jira/browse/PDFBOX-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-6. Resolution: Implemented This has been implemented long ago by John J. Barton in PDFText2HTML.java. PDF to HTML conversion -- Key: PDFBOX-6 URL: https://issues.apache.org/jira/browse/PDFBOX-6 Project: PDFBox Issue Type: New Feature Components: Utilities [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=802407 Originally submitted by winstanley_john on 2003-09-08 04:42. PDF to HTML conversion. Conserve formating. check out www.sourceforge.net/projects/pdftohtml for a hack of this process. [comment on SourceForge] Originally sent by winstanley_john. Logged In: YES user_id=747013 Also conversion to xml or word etc would be amazing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (PDFBOX-45) Support incremental save
[ https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-45. - Resolution: Fixed This has been implemented in PDDocument.saveIncremental(). Support incremental save Key: PDFBOX-45 URL: https://issues.apache.org/jira/browse/PDFBOX-45 Project: PDFBox Issue Type: New Feature Components: Writing [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=1157431 Originally submitted by purplish_cat on 2005-03-05 12:28. After opening a PDF file and changing objects out of it, allow to save the changes incrementally to the same file instead of creating a completely new file. [comment on SourceForge] Originally sent by benlitchfield. Logged In: YES user_id=601708 See forum thread at https://sourceforge.net/forum/message.php?msg_id=3032112 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1845) PDDocument.load() give Error: Expected a long type at offset 1633
[ https://issues.apache.org/jira/browse/PDFBOX-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988831#comment-13988831 ] Tilman Hausherr commented on PDFBOX-1845: - I uncompressed the first PDF with qpdf and now PDFBox can process it. If [~david.keller] wants to render this file he won't like it, because the images are compressed with JPEG2000 and there's a bug in the plugin. PDDocument.load() give Error: Expected a long type at offset 1633 - Key: PDFBOX-1845 URL: https://issues.apache.org/jira/browse/PDFBOX-1845 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.0, 2.0.0 Environment: Windows 8.1 Reporter: David KELLER Priority: Blocker Attachments: 14 01 2014-2.pdf, 14 01 2014.pdf I run this simple program with the file in attachment (scanned OCR document from Nuance Omnipage 18) public static void main(String[] args) throws Exception { System.out.println(Start SplitFileTest...); String path = D:\\test\\batch\\scan_manual\\courrier\\david.keller\\; String pdfFile = path + 14 01 2014.pdf; FileInputStream pdfInputStream = new FileInputStream(pdfFile); PDDocument pdDocument = PDDocument.load(pdfInputStream); ListPDPage pages = pdDocument.getDocumentCatalog().getAllPages(); pdfInputStream.close(); } And with the 1.8.0 version I have this error : java.io.IOException: Error: Expected an integer type, actual='12977[373' at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622) at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100) at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:604) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187) And I have just builded the 2.0.0 from the last code source and I have this error : java.io.IOException: Error: Expected a long type at offset 1633 at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1682) at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100) at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:663) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1101) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object
[ https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988834#comment-13988834 ] Rogério Pereira Araújo commented on PDFBOX-1122: I'm trying to parse several ebooks in my local filesystem using nutch, which makes use of tika and pdfbox to do the parsing. No matter which file I use, I'm always getting the same error as described by Raihan. Anyway, I'll be attaching one of them. Parsing Error, Skipping Object -- Key: PDFBOX-1122 URL: https://issues.apache.org/jira/browse/PDFBOX-1122 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0 Environment: Working with Windows 7 in eclipse. Reporter: Raihan Jamal Assignee: Andreas Lehmkühler Labels: pdfbox Original Estimate: 336h Remaining Estimate: 336h Parsing Error, Skipping Object java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@38011d45 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:357) at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) at java.lang.Thread.run(Thread.java:662) Did not found XRef object at specified startxref position 0 This is the sample URL where I am facing this problem:- http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf Any suggestions why is it happening...!! Or its a bug?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object
[ https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988836#comment-13988836 ] Rogério Pereira Araújo commented on PDFBOX-1122: I couldn't attach the PDF to the ticket, but here's the link: https://dl.dropboxusercontent.com/u/13175227/c-programming-a-modern-approach-2nd-edition.9780393979503.52279.pdf Parsing Error, Skipping Object -- Key: PDFBOX-1122 URL: https://issues.apache.org/jira/browse/PDFBOX-1122 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0 Environment: Working with Windows 7 in eclipse. Reporter: Raihan Jamal Assignee: Andreas Lehmkühler Labels: pdfbox Original Estimate: 336h Remaining Estimate: 336h Parsing Error, Skipping Object java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@38011d45 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:357) at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) at java.lang.Thread.run(Thread.java:662) Did not found XRef object at specified startxref position 0 This is the sample URL where I am facing this problem:- http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf Any suggestions why is it happening...!! Or its a bug?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (PDFBOX-45) Support incremental save
[ https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Chojecki reopened PDFBOX-45: --- The saveIncremental(...) method was a first try to do this, but is unfortunately only work for signatures. The recursive writer make it hard to implement this feature, because he starts always with the catalog and if for example a new page was added, all elements on the way to this new page need to be written again. So we need a extra writer for this task that does not work recursive and instead use maybe a set or a map for objects that need to be written. Or the writer should iterate always over all objects and only write new ones. I would prefer a new writer, because it seams to be cleaner to write objects from a collection instead of trying to iterate through all objects and finding new ones and maybe end up in loops. So it's a bigger task for the version 2.0 Support incremental save Key: PDFBOX-45 URL: https://issues.apache.org/jira/browse/PDFBOX-45 Project: PDFBox Issue Type: New Feature Components: Writing [imported from SourceForge] http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=1157431 Originally submitted by purplish_cat on 2005-03-05 12:28. After opening a PDF file and changing objects out of it, allow to save the changes incrementally to the same file instead of creating a completely new file. [comment on SourceForge] Originally sent by benlitchfield. Logged In: YES user_id=601708 See forum thread at https://sourceforge.net/forum/message.php?msg_id=3032112 -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [VOTE] Release Apache PDFBox 1.8.5
+1 -- John On 28 Apr 2014, at 10:57, Andreas Lehmkuehler andr...@lehmi.de wrote: Hi, a candidate for the PDFBox 1.8.5 release is available at: http://people.apache.org/~lehmi/pdfbox/1.8.5/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/ The SHA1 checksum of the archive is fc01acc1e2575ff1f40e44e949a862fcae076029. Please vote on releasing this package as Apache PDFBox 1.8.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 PDFBox PMC votes are cast. [ ] +1 Release this package as Apache PDFBox 1.8.5 [ ] -1 Do not release this package because... Here is my +1 BR Andreas Lehmkühler
[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object
[ https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988924#comment-13988924 ] Tilman Hausherr commented on PDFBOX-1122: - I was able to parse that one with my own application, both with load() and loadNonseq() by setting -Xmx3g in the 2.0 version. With the current 1.8 version, I could do it without modifications. Then I downloaded the 1.8.4 app and used the PDFReader command, and it also worked. How do you know that Apache nutch is using 1.8.4? A look at their readme shows this: https://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt Upgrade to PDFBox 0.7.3. And in NUTCH-1770, you write it fails at all PDFs. Parsing Error, Skipping Object -- Key: PDFBOX-1122 URL: https://issues.apache.org/jira/browse/PDFBOX-1122 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0 Environment: Working with Windows 7 in eclipse. Reporter: Raihan Jamal Assignee: Andreas Lehmkühler Labels: pdfbox Original Estimate: 336h Remaining Estimate: 336h Parsing Error, Skipping Object java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@38011d45 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:357) at edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37) at edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129) at java.lang.Thread.run(Thread.java:662) Did not found XRef object at specified startxref position 0 This is the sample URL where I am facing this problem:- http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf Any suggestions why is it happening...!! Or its a bug?? -- This message was sent by Atlassian JIRA (v6.2#6252)