[jira] [Commented] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988598#comment-13988598
 ] 

Tilman Hausherr commented on PDFBOX-2054:
-

Committed a first round in rev 1592153 for the trunk and rev 1592155 for the 
1.8 branch.

 Remove System.out.println()
 ---

 Key: PDFBOX-2054
 URL: https://issues.apache.org/jira/browse/PDFBOX-2054
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Hong-Thai Nguyen
Assignee: Tilman Hausherr
Priority: Minor

 For example at GlyfSimpleDescript.java
 {code}
 ...
 catch (ArrayIndexOutOfBoundsException e)
 {
 System.out.println(error: array index out of bounds);
 }
 {code}
 and also 'printStackTrace' like in PageDrawer.java:
 {code}
 ...
 catch( IOException io )
 {
 io.printStackTrace();
 }
 {code}
 Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988602#comment-13988602
 ] 

Tilman Hausherr commented on PDFBOX-2053:
-

This is very similar to PDFBOX-62, although the fix I proposed there doesn't 
work there, for a reason that I don't know yet.

 Issue with PDFBox position reading
 --

 Key: PDFBOX-2053
 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.3
Reporter: Orbel Mkrtchyan
 Attachments: test.pdf


 Using PDFBox 1.8.4,
 bug #1:
   PDDocument doc = new PDDocument();
   doc.load(test-pcc7247.pdf);
   doc.save(out.pdf);
   doc.close();
 The resulting file is corrupted, contains 0 pages and cannot be viewed by 
 Acrobat Reader.
 bug #2: consider the following code snippet. The code runs like this:
   Extractor extractor = new Extractor();
   extractor.writeText(pdDoc, output);
 Using the code defined like this:
 public class Extractor extends PDFTextStripper {
 ...
 protected void writePage() throws IOException
 {
 for( int i = 0; i  charactersByArticle.size(); i++)
 {
 ListTextPosition textList = charactersByArticle.get( i );
 Iterator textIter = textList.iterator();
 while( textIter.hasNext() )
 {
 TextPosition position = (TextPosition)textIter.next();
 In the given piece of code, position variable correctly iterates through the 
 letters of the first line of the provided pdf document, but its coordinates 
 (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
 relates to 1 letter, and its widths array's length always equals 1. So we get 
 the same coordinates for every letter in a line. Expected behaviour is either 
 having new coordinates per letter or having widths[] contain widths for the 
 characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Ajay Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988605#comment-13988605
 ] 

Ajay Bhat commented on PDFBOX-1584:
---

Has this patch been commited yet?

 Add unit test for RandomAccessFileOutputStream
 --

 Key: PDFBOX-1584
 URL: https://issues.apache.org/jira/browse/PDFBOX-1584
 Project: PDFBox
  Issue Type: Test
  Components: Writing
Affects Versions: 1.8.1
Reporter: Fredrik Kjellberg
Priority: Minor
 Attachments: TestRandomAccessFileOutputStream_diff.txt


 This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Ajay Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988605#comment-13988605
 ] 

Ajay Bhat edited comment on PDFBOX-1584 at 5/3/14 7:07 AM:
---

Has this patch been committed?


was (Author: ajay bhat):
Has this patch been commited yet?

 Add unit test for RandomAccessFileOutputStream
 --

 Key: PDFBOX-1584
 URL: https://issues.apache.org/jira/browse/PDFBOX-1584
 Project: PDFBox
  Issue Type: Test
  Components: Writing
Affects Versions: 1.8.1
Reporter: Fredrik Kjellberg
Priority: Minor
 Attachments: TestRandomAccessFileOutputStream_diff.txt


 This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988643#comment-13988643
 ] 

Tilman Hausherr commented on PDFBOX-1584:
-

No. This happens sometimes :-(

I looked at the patch, then I looked at the timeline of the submitter 
([~frkj]), this is actually part of a bigger discussion about problems with 
RandomAccessFileOutputStream and RandomAccessFileInputStream. I will try to 
understand what this is about. Be patient

 Add unit test for RandomAccessFileOutputStream
 --

 Key: PDFBOX-1584
 URL: https://issues.apache.org/jira/browse/PDFBOX-1584
 Project: PDFBox
  Issue Type: Test
  Components: Writing
Affects Versions: 1.8.1
Reporter: Fredrik Kjellberg
Priority: Minor
 Attachments: TestRandomAccessFileOutputStream_diff.txt


 This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-62:
--

Attachment: (was: PDTrueTypeFont.diff)

 Incorrect (zero) character widths returned in some docs
 ---

 Key: PDFBOX-62
 URL: https://issues.apache.org/jira/browse/PDFBOX-62
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering, Text extraction
Assignee: Andreas Lehmkühler
 Attachments: 5542.pdf, pdfbox-2006-zerowidth.pdf-1.png, 
 pdfbox-62-zerowidth.pdf-1.png


 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1216674
 Originally submitted by tamirhassan on 2005-06-07 13:42.
 For certain PDF documents (such as the one attached) 
 the character/string widths (as obtained e.g. by the 
 PDFont.getStringWidth method) are not returned 
 correctly, i.e. they appear to be correct for punctuation 
 characters but are zero for alphanumeric characters.  
 It seems as if these alphanumeric characters are NOT 
 within PDFont.firstChar and PDFont.lastChar in the 
 Type 1 font.  The method therefore attempts to obtain 
 the font widths from the AFM (font metric) file, but fails 
 (silently) with a 'resource is null' logline message.
 (Note that this problem doesn't seem to occur with Type 
 1 fonts in other documents.)
 A more detailed discussion regarding this issue can be 
 found in this link:
 http://sourceforge.net/forum/forum.php?
 thread_id=1260349forum_id=267205
 Thanks in advance for any help that can be obtained,
 Tam



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-62:
--

Attachment: PDTrueTypeFont.diff

Updated patch to handle the file from PDFBOX-2053, it maps missing Arial withs 
to Helvetica AFM files.

 Incorrect (zero) character widths returned in some docs
 ---

 Key: PDFBOX-62
 URL: https://issues.apache.org/jira/browse/PDFBOX-62
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering, Text extraction
Assignee: Andreas Lehmkühler
 Attachments: 5542.pdf, PDTrueTypeFont.diff, 
 pdfbox-2006-zerowidth.pdf-1.png, pdfbox-62-zerowidth.pdf-1.png


 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1216674
 Originally submitted by tamirhassan on 2005-06-07 13:42.
 For certain PDF documents (such as the one attached) 
 the character/string widths (as obtained e.g. by the 
 PDFont.getStringWidth method) are not returned 
 correctly, i.e. they appear to be correct for punctuation 
 characters but are zero for alphanumeric characters.  
 It seems as if these alphanumeric characters are NOT 
 within PDFont.firstChar and PDFont.lastChar in the 
 Type 1 font.  The method therefore attempts to obtain 
 the font widths from the AFM (font metric) file, but fails 
 (silently) with a 'resource is null' logline message.
 (Note that this problem doesn't seem to occur with Type 
 1 fonts in other documents.)
 A more detailed discussion regarding this issue can be 
 found in this link:
 http://sourceforge.net/forum/forum.php?
 thread_id=1260349forum_id=267205
 Thanks in advance for any help that can be obtained,
 Tam



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988657#comment-13988657
 ] 

Tilman Hausherr commented on PDFBOX-2053:
-

I updated my fix for PDFBOX-62 and rendering works for me. The problems that 
you get with your Extractor class are there because of the zero widths problem.

Re bug1: correct your code to

{code}
PDDocument doc = PDDocument.load(test-pcc7247.pdf);
doc.save(out.pdf);
doc.close();
{code}


 Issue with PDFBox position reading
 --

 Key: PDFBOX-2053
 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.3
Reporter: Orbel Mkrtchyan
 Attachments: test.pdf


 Using PDFBox 1.8.4,
 bug #1:
   PDDocument doc = new PDDocument();
   doc.load(test-pcc7247.pdf);
   doc.save(out.pdf);
   doc.close();
 The resulting file is corrupted, contains 0 pages and cannot be viewed by 
 Acrobat Reader.
 bug #2: consider the following code snippet. The code runs like this:
   Extractor extractor = new Extractor();
   extractor.writeText(pdDoc, output);
 Using the code defined like this:
 public class Extractor extends PDFTextStripper {
 ...
 protected void writePage() throws IOException
 {
 for( int i = 0; i  charactersByArticle.size(); i++)
 {
 ListTextPosition textList = charactersByArticle.get( i );
 Iterator textIter = textList.iterator();
 while( textIter.hasNext() )
 {
 TextPosition position = (TextPosition)textIter.next();
 In the given piece of code, position variable correctly iterates through the 
 letters of the first line of the provided pdf document, but its coordinates 
 (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
 relates to 1 letter, and its widths array's length always equals 1. So we get 
 the same coordinates for every letter in a line. Expected behaviour is either 
 having new coordinates per letter or having widths[] contain widths for the 
 characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-2053) Issue with PDFBox position reading

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2053.
---

Resolution: Duplicate

 Issue with PDFBox position reading
 --

 Key: PDFBOX-2053
 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.3
Reporter: Orbel Mkrtchyan
 Attachments: test.pdf


 Using PDFBox 1.8.4,
 bug #1:
   PDDocument doc = new PDDocument();
   doc.load(test-pcc7247.pdf);
   doc.save(out.pdf);
   doc.close();
 The resulting file is corrupted, contains 0 pages and cannot be viewed by 
 Acrobat Reader.
 bug #2: consider the following code snippet. The code runs like this:
   Extractor extractor = new Extractor();
   extractor.writeText(pdDoc, output);
 Using the code defined like this:
 public class Extractor extends PDFTextStripper {
 ...
 protected void writePage() throws IOException
 {
 for( int i = 0; i  charactersByArticle.size(); i++)
 {
 ListTextPosition textList = charactersByArticle.get( i );
 Iterator textIter = textList.iterator();
 while( textIter.hasNext() )
 {
 TextPosition position = (TextPosition)textIter.next();
 In the given piece of code, position variable correctly iterates through the 
 letters of the first line of the provided pdf document, but its coordinates 
 (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
 relates to 1 letter, and its widths array's length always equals 1. So we get 
 the same coordinates for every letter in a line. Expected behaviour is either 
 having new coordinates per letter or having widths[] contain widths for the 
 characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988671#comment-13988671
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I can confirm the same error on version 1.8.4 while parsing PDFs with Tika 
during Nutch parsing job.

 Parsing Error, Skipping Object
 --

 Key: PDFBOX-1122
 URL: https://issues.apache.org/jira/browse/PDFBOX-1122
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0
 Environment: Working with Windows 7 in eclipse.
Reporter: Raihan Jamal
Assignee: Andreas Lehmkühler
  Labels: pdfbox
   Original Estimate: 336h
  Remaining Estimate: 336h

 Parsing Error, Skipping Object
 java.io.IOException: expected='endstream' actual='' 
 org.apache.pdfbox.io.PushBackInputStream@38011d45
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.Tika.parseToString(Tika.java:357)
   at 
 edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
   at java.lang.Thread.run(Thread.java:662)
 Did not found XRef object at specified startxref position 0
 This is the sample URL where I am facing this problem:-
 http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
 Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988685#comment-13988685
 ] 

Tilman Hausherr commented on PDFBOX-1122:
-

The old URL no longer works. What file did you use?

 Parsing Error, Skipping Object
 --

 Key: PDFBOX-1122
 URL: https://issues.apache.org/jira/browse/PDFBOX-1122
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0
 Environment: Working with Windows 7 in eclipse.
Reporter: Raihan Jamal
Assignee: Andreas Lehmkühler
  Labels: pdfbox
   Original Estimate: 336h
  Remaining Estimate: 336h

 Parsing Error, Skipping Object
 java.io.IOException: expected='endstream' actual='' 
 org.apache.pdfbox.io.PushBackInputStream@38011d45
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.Tika.parseToString(Tika.java:357)
   at 
 edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
   at java.lang.Thread.run(Thread.java:662)
 Did not found XRef object at specified startxref position 0
 This is the sample URL where I am facing this problem:-
 http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
 Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-1584.
-

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.8.6

Committed in rev 1592225 for the trunk and 1592226 for the 1.8 branch. Thanks! 
I won't commit PDFBOX-1582 for now as this doesn't seem to be complete.

 Add unit test for RandomAccessFileOutputStream
 --

 Key: PDFBOX-1584
 URL: https://issues.apache.org/jira/browse/PDFBOX-1584
 Project: PDFBox
  Issue Type: Test
  Components: Writing
Affects Versions: 1.8.1
Reporter: Fredrik Kjellberg
Priority: Minor
 Fix For: 1.8.6, 2.0.0

 Attachments: TestRandomAccessFileOutputStream_diff.txt


 This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988749#comment-13988749
 ] 

Tilman Hausherr commented on PDFBOX-2054:
-

Committed a second round in rev 1592251 for the trunk and rev 1592252 for the 
1.8 branch.

 Remove System.out.println()
 ---

 Key: PDFBOX-2054
 URL: https://issues.apache.org/jira/browse/PDFBOX-2054
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Hong-Thai Nguyen
Assignee: Tilman Hausherr
Priority: Minor

 For example at GlyfSimpleDescript.java
 {code}
 ...
 catch (ArrayIndexOutOfBoundsException e)
 {
 System.out.println(error: array index out of bounds);
 }
 {code}
 and also 'printStackTrace' like in PageDrawer.java:
 {code}
 ...
 catch( IOException io )
 {
 io.printStackTrace();
 }
 {code}
 Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2054.
-

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.8.6

I'm done. I didn't touch preflight, examples and the GUI tools. Thanks for 
pointing us to these legacy problems.

 Remove System.out.println()
 ---

 Key: PDFBOX-2054
 URL: https://issues.apache.org/jira/browse/PDFBOX-2054
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Hong-Thai Nguyen
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.6, 2.0.0


 For example at GlyfSimpleDescript.java
 {code}
 ...
 catch (ArrayIndexOutOfBoundsException e)
 {
 System.out.println(error: array index out of bounds);
 }
 {code}
 and also 'printStackTrace' like in PageDrawer.java:
 {code}
 ...
 catch( IOException io )
 {
 io.printStackTrace();
 }
 {code}
 Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Component/s: AcroForm

 Merging PDFs has no effect
 --

 Key: PDFBOX-2045
 URL: https://issues.apache.org/jira/browse/PDFBOX-2045
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm, Utilities
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Gerhard Temper
 Attachments: specialpdf.pdf


 Merging attached PDF results in a PDF consisting only of the special PDF 
 ignoring all other PDFs without any error.
 Command line to reproduce the problem:
 java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Affects Version/s: 2.0.0
   1.8.6
   1.8.5

 Merging PDFs has no effect
 --

 Key: PDFBOX-2045
 URL: https://issues.apache.org/jira/browse/PDFBOX-2045
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm, Utilities
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Gerhard Temper
 Attachments: specialpdf.pdf


 Merging attached PDF results in a PDF consisting only of the special PDF 
 ignoring all other PDFs without any error.
 Command line to reproduce the problem:
 java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Description: 
Merging attached special PDF (a form) results in a PDF consisting only of the 
PDF form ignoring all other PDFs without any error.

Command line to reproduce the problem:
java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf

  was:
Merging attached PDF results in a PDF consisting only of the special PDF 
ignoring all other PDFs without any error.

Command line to reproduce the problem:
java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf


 Merging PDFs has no effect
 --

 Key: PDFBOX-2045
 URL: https://issues.apache.org/jira/browse/PDFBOX-2045
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm, Utilities
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Gerhard Temper
 Attachments: specialpdf.pdf


 Merging attached special PDF (a form) results in a PDF consisting only of the 
 PDF form ignoring all other PDFs without any error.
 Command line to reproduce the problem:
 java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: 1.8.5 and JIRA

2014-05-03 Thread Andreas Lehmkuehler

Hi,

Am 03.05.2014 07:45, schrieb Tilman Hausherr:

Hallo Andreas,

Thanks for all your work; only one thing is missing, 1.8.5 is still listed as
unreleased version in JIRA, e.g. here:
https://issues.apache.org/jira/browse/PDFBOX/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

I wanted to wait until to release is official announced. Thanks for the 
reminder.


Tilman

Am 02.05.2014 09:27, schrieb Andreas Lehmkühler:

Hi,

due to the newest PDFBox 1.8.5 release I've closed all 1.8.5 related issues
in a bulk operation. I've disabled the email notification to avoid an email
flood.
I've also added the all new version 1.8.6 for our next bugfix release ...

I'll update the download page once the mirrors copied the version from our
repository.

BR
Andreas Lehmkühler




BR
Andreas Lehmkühler


[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988772#comment-13988772
 ] 

Tilman Hausherr commented on PDFBOX-2045:
-

I slighly clarified your text above to mention that it is a form.

The weird thing is that PDFBox renders all pages, and so does GSView. Only 
Acrobat Viewer doesn't.

 Merging PDFs has no effect
 --

 Key: PDFBOX-2045
 URL: https://issues.apache.org/jira/browse/PDFBOX-2045
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm, Utilities
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Gerhard Temper
 Attachments: specialpdf.pdf


 Merging attached special PDF (a form) results in a PDF consisting only of the 
 PDF form ignoring all other PDFs without any error.
 Command line to reproduce the problem:
 java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988774#comment-13988774
 ] 

Tilman Hausherr commented on PDFBOX-2045:
-

I tried your test with the file at
http://www.studentenwerk-berlin.de/wohnen/dokumente/41%20%7C%20Anmeldung.pdf
which has a form on page 3 and 4. After merging, all pages can be displayed 
with Acrobat, but the form capability is lost.

 Merging PDFs has no effect
 --

 Key: PDFBOX-2045
 URL: https://issues.apache.org/jira/browse/PDFBOX-2045
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm, Utilities
Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Reporter: Gerhard Temper
 Attachments: specialpdf.pdf


 Merging attached special PDF (a form) results in a PDF consisting only of the 
 PDF form ignoring all other PDFs without any error.
 Command line to reproduce the problem:
 java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-2033) Narrow long pdf is printed blank

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2033.
---

Resolution: Not a Problem

Thanks, closing this, in retrospect, this was rather a howto question and not a 
bug.

 Narrow long pdf is printed blank
 

 Key: PDFBOX-2033
 URL: https://issues.apache.org/jira/browse/PDFBOX-2033
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
 Environment: W7
Reporter: Tilman Hausherr

 Based on the post of Norbert Sándor to the user list:
 When printing a 198.425 x 1700.787 sized page (70 x 600 mm) using new 
 PDFPrinter(pdfDocument, printJob).silentPrint() to a virtual printer (e.g. 
 PDFCreator or CIB), then the resulting PDF has a page size of 8,26x11,69 and 
 the content is horizontally centered on the page. I was able to reproduce the 
 problem, and also to print on a virtual printer that creates new PDFs of that 
 size, by using the longest constructor of PDFPrinter(), but then the output 
 is blank.
 While it doesn't seem useful to print a PDF to a PDF, the problem might make 
 sense when printing to a cash register receipt printer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-1961) Page with annotations renders fine with 1.8 but not with 2.0

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1961:


Labels: Annotations regression  (was: regression)

 Page with annotations renders fine with 1.8 but not with 2.0
 

 Key: PDFBOX-1961
 URL: https://issues.apache.org/jira/browse/PDFBOX-1961
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: Annotations, regression
 Fix For: 2.0.0

 Attachments: annots.pdf, annots.pdf-2-v18.png, annots.pdf-2-v2.png


 Page 2 of the attached PDF (from a ghostscript installation) renders fine 
 with 1.8 but not with 2.0. The other pages are not rendered properly with any 
 version.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2014) PDAnnotationLink

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988811#comment-13988811
 ] 

Tilman Hausherr commented on PDFBOX-2014:
-

Do you have a sample PDF that has what you want?

 PDAnnotationLink 
 -

 Key: PDFBOX-2014
 URL: https://issues.apache.org/jira/browse/PDFBOX-2014
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: WALID CHARFI

 Hi,
 I want to draw a text link without any hover effect, neither solid border.
 I tried this code but it does not work.
 Could you provide me with a solution please? thank you very much.
 PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary();
 borderULine.setStyle(PDBorderStyleDictionary.STYLE_INSET); 
 PDAnnotationLink txtLink = new PDAnnotationLink();
 txtLink.setRectangle(position);
 PDActionURI action = new PDActionURI();
 action.setURI(pdfPara.getUri());
 txtLink.setAction(action);
 txtLink.setBorderStyle(borderULine);
 txtLink.setHighlightMode(PDAnnotationLink.HIGHLIGHT_MODE_NONE);



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-167) wrong words highlighted

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-167.
--

Resolution: Cannot Reproduce

On october 2013, I e-mailed both people mentioned in this issue:
{quote}
Is this still an issue? I looked at the code and it is different than the one 
mentioned. But I can't test the code mentioned because the links are broken.
{quote}
I never got a response. I am thus closing this issue.

 wrong words highlighted
 ---

 Key: PDFBOX-167
 URL: https://issues.apache.org/jira/browse/PDFBOX-167
 Project: PDFBox
  Issue Type: Bug
Priority: Minor

 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552832aid=1487217
 Originally submitted by nobody on 2006-05-12 01:51.
 PDFBox appears to have a problem properly highlighting
 words from the following PDF. I am using a very simple
 servlet to do this, and it works fine for most PDFs.
 With this one, however, it highlights the wrong words.
 Unfortunately I am not smart enough to figure out what
 is going on myself, so could anybody help me with this?
 The files can be found here:
 http://www.impressie.nl/matthijs/PDFHighlight.java
 http://www.impressie.nl/matthijs/Rectificatie%20van%20Richtlijn%20Handhaving%20van%20Intellectuele-eigendomsrechten.pdf
 Matthijs Bierman
 matth...@impressie.nl
 [comment on SourceForge]
 Originally sent by nobody.
 Logged In: NO 
 That document is in a password-protected area, so it can't be read by anyone 
 else! I have a similar problem with this doc:
 http://www.usc.edu/schools/business/FBE/seminars/papers/AE_4-28-06_FISMAN-parking.pdf
 ... but I think I've figured this one out. The second page of this document 
 is entirely blank, and checking by hand I can see that the highlights after 
 p1 are all in positions that would be correct if they were one page further 
 on; it appears that the page count isn't being incremented for the blank 
 page. Tracing this back in the code I see this:
 PDStream contentStream = nextPage.getContents();
 if( contentStream != null )
 {
 COSStream contents = contentStream.getStream();
 processPage( nextPage, contents );
 }
 (PDFTextStripper.java line 255). That's skipping the blank page and giving me 
 the wrong page no, I think - and I guess that the problem can be resolved by 
 moving currentPageNo++ from inside processPage to just above that test.
 -- brian.ew...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1731) Converting pdf to Image

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988814#comment-13988814
 ] 

Tilman Hausherr commented on PDFBOX-1731:
-

[~paulocamargomello] does your problem still happen with the current (just 
released) version? We have lessened the memory footprint somewhat.

 Converting pdf to Image
 ---

 Key: PDFBOX-1731
 URL: https://issues.apache.org/jira/browse/PDFBOX-1731
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.2
 Environment: Windows 8  and Linux 
 JDK 1.7
Reporter: Paulo R C Mello Junior
  Labels: newbie

 I'm trying to convert a pdf page to image but an exception occurs:
 17:28:20,652 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] 
 (Thread-69) Something went wrong ... the pixelmap doesn't contain any data.
 17:28:20,654 WARN  [org.apache.pdfbox.util.operator.pagedrawer.Invoke] 
 (Thread-69) getRGBImage returned NULL
 17:28:20,661 INFO  [org.apache.pdfbox.util.PDFStreamEngine] (Thread-69) 
 unsupported/disabled operation: i
 17:28:36,809 ERROR [stderr] (Thread-70) Exception in thread Thread-70 
 java.lang.OutOfMemoryError: Java heap space
 17:28:36,811 ERROR [stderr] (Thread-70)   at 
 java.awt.image.DataBufferByte.init(DataBufferByte.java:92)
 17:28:36,812 ERROR [stderr] (Thread-70)   at 
 java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:415)
 17:28:36,814 ERROR [stderr] (Thread-70)   at 
 java.awt.image.Raster.createWritableRaster(Raster.java:941)
 17:28:36,814 ERROR [stderr] (Thread-70)   at 
 javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1073)
 17:28:36,815 ERROR [stderr] (Thread-70)   at 
 javax.imageio.ImageReader.getDestination(ImageReader.java:2896)
 17:28:36,816 ERROR [stderr] (Thread-70)   at 
 com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1066)
 17:28:36,817 ERROR [stderr] (Thread-70)   at 
 com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1034)
 17:28:36,818 ERROR [stderr] (Thread-70)   at 
 javax.imageio.ImageIO.read(ImageIO.java:1448)
 17:28:36,818 ERROR [stderr] (Thread-70)   at 
 javax.imageio.ImageIO.read(ImageIO.java:1352)
 17:28:36,819 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg.getRGBImage(PDJpeg.java:264)
 17:28:36,820 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:83)
 17:28:36,821 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
 17:28:36,823 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
 17:28:36,824 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
 17:28:36,825 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
 17:28:36,826 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:125)
 17:28:36,827 ERROR [stderr] (Thread-70)   at 
 org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:769)
 My code:
 public static ListBufferedImage getPdfPagesAsImages(String pdfPath)
   throws IOException {
   File f = new File(pdfPath);
   PDDocument pdfDocument = null;
   pdfDocument = PDDocument.loadNonSeq(f, null);
   ListBufferedImage bImages = new ArrayListBufferedImage();
   try {
   System.out.println(pdfPath);
   int resolution = 185;
   if (pdfDocument != null) {
   @SuppressWarnings(unchecked)
   ListPDPage pages = (ListPDPage) pdfDocument
   
 .getDocumentCatalog().getAllPages();
   for (PDPage p : pages) {
   BufferedImage convertedImage = 
 p.convertToImage(
   
 BufferedImage.TYPE_INT_RGB, resolution);
   if (isNegativeImage(convertedImage)) {
   
 bImages.add(invertNegativeImage(convertedImage));
   } else {
   bImages.add(convertedImage);
   }
   }
   }
   } catch (FileNotFoundException e) {
   e.printStackTrace();
   e.getMessage();
   e.getCause();
   } 

[jira] [Closed] (PDFBOX-9) HTML - PDF

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-9.


Resolution: Won't Fix

This is beyond the scope of PDFBox. You can print your HTML to a virtual 
printer who will create a PDF.

 HTML - PDF
 ---

 Key: PDFBOX-9
 URL: https://issues.apache.org/jira/browse/PDFBOX-9
 Project: PDFBox
  Issue Type: New Feature
  Components: Utilities

 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=841169
 Originally submitted by nobody on 2003-11-12 20:03.
 It would be really nice to take a html and create a PDF 
 from it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-6) PDF to HTML conversion

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-6.


Resolution: Implemented

This has been implemented long ago by John J. Barton in PDFText2HTML.java.

 PDF to HTML conversion
 --

 Key: PDFBOX-6
 URL: https://issues.apache.org/jira/browse/PDFBOX-6
 Project: PDFBox
  Issue Type: New Feature
  Components: Utilities

 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=802407
 Originally submitted by winstanley_john on 2003-09-08 04:42.
 PDF to HTML conversion. 
 Conserve formating.
 check out www.sourceforge.net/projects/pdftohtml 
 for a hack of this process.
 [comment on SourceForge]
 Originally sent by winstanley_john.
 Logged In: YES 
 user_id=747013
 Also conversion to xml or word etc would be amazing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-45) Support incremental save

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-45.
-

Resolution: Fixed

This has been implemented in PDDocument.saveIncremental().

 Support incremental save
 

 Key: PDFBOX-45
 URL: https://issues.apache.org/jira/browse/PDFBOX-45
 Project: PDFBox
  Issue Type: New Feature
  Components: Writing

 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=1157431
 Originally submitted by purplish_cat on 2005-03-05 12:28.
 After opening a PDF file and changing objects out of it, 
 allow to save the changes incrementally to the same file 
 instead of creating a completely new file.
 [comment on SourceForge]
 Originally sent by benlitchfield.
 Logged In: YES 
 user_id=601708
 See forum thread at
 https://sourceforge.net/forum/message.php?msg_id=3032112



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1845) PDDocument.load() give Error: Expected a long type at offset 1633

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988831#comment-13988831
 ] 

Tilman Hausherr commented on PDFBOX-1845:
-

I uncompressed the first PDF with qpdf and now PDFBox can process it. If 
[~david.keller] wants to render this file he won't like it, because the images 
are compressed with JPEG2000 and there's a bug in the plugin.

 PDDocument.load() give Error: Expected a long type at offset 1633
 -

 Key: PDFBOX-1845
 URL: https://issues.apache.org/jira/browse/PDFBOX-1845
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.0, 2.0.0
 Environment: Windows 8.1
Reporter: David KELLER
Priority: Blocker
 Attachments: 14 01 2014-2.pdf, 14 01 2014.pdf


 I run this simple program with the file in attachment (scanned OCR document 
 from Nuance Omnipage 18)
   public static void main(String[] args)
   throws Exception {
   System.out.println(Start SplitFileTest...);
   String path = 
 D:\\test\\batch\\scan_manual\\courrier\\david.keller\\;
   String pdfFile = path + 14 01 2014.pdf;
   
   FileInputStream pdfInputStream = new FileInputStream(pdfFile);
   
   PDDocument pdDocument = PDDocument.load(pdfInputStream);
   ListPDPage pages = 
 pdDocument.getDocumentCatalog().getAllPages();
   
   pdfInputStream.close();
   }
 And with the 1.8.0 version I have this error :
 java.io.IOException: Error: Expected an integer type, actual='12977[373'
 at 
 org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
 at 
 org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
 at 
 org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:604)
 at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187)
 And I have just builded the 2.0.0 from the last code source and I have this 
 error :
  java.io.IOException: Error: Expected a long type at offset 1633
   at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1682)
   at 
 org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
   at 
 org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:663)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1101)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988834#comment-13988834
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I'm trying to parse several ebooks in my local filesystem using nutch, which 
makes use of tika and pdfbox to do the parsing.

No matter which file I use, I'm always getting the same error as described by 
Raihan.

Anyway, I'll be attaching one of them.

 Parsing Error, Skipping Object
 --

 Key: PDFBOX-1122
 URL: https://issues.apache.org/jira/browse/PDFBOX-1122
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0
 Environment: Working with Windows 7 in eclipse.
Reporter: Raihan Jamal
Assignee: Andreas Lehmkühler
  Labels: pdfbox
   Original Estimate: 336h
  Remaining Estimate: 336h

 Parsing Error, Skipping Object
 java.io.IOException: expected='endstream' actual='' 
 org.apache.pdfbox.io.PushBackInputStream@38011d45
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.Tika.parseToString(Tika.java:357)
   at 
 edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
   at java.lang.Thread.run(Thread.java:662)
 Did not found XRef object at specified startxref position 0
 This is the sample URL where I am facing this problem:-
 http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
 Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988836#comment-13988836
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I couldn't attach the PDF to the ticket, but here's the link:

https://dl.dropboxusercontent.com/u/13175227/c-programming-a-modern-approach-2nd-edition.9780393979503.52279.pdf

 Parsing Error, Skipping Object
 --

 Key: PDFBOX-1122
 URL: https://issues.apache.org/jira/browse/PDFBOX-1122
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0
 Environment: Working with Windows 7 in eclipse.
Reporter: Raihan Jamal
Assignee: Andreas Lehmkühler
  Labels: pdfbox
   Original Estimate: 336h
  Remaining Estimate: 336h

 Parsing Error, Skipping Object
 java.io.IOException: expected='endstream' actual='' 
 org.apache.pdfbox.io.PushBackInputStream@38011d45
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.Tika.parseToString(Tika.java:357)
   at 
 edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
   at java.lang.Thread.run(Thread.java:662)
 Did not found XRef object at specified startxref position 0
 This is the sample URL where I am facing this problem:-
 http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
 Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (PDFBOX-45) Support incremental save

2014-05-03 Thread Thomas Chojecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Chojecki reopened PDFBOX-45:
---


The saveIncremental(...) method was a first try to do this, but is 
unfortunately only work for signatures. The recursive writer make it hard to 
implement this feature, because he starts always with the catalog and if for 
example a new page was added, all elements on the way to this new page need to 
be written again.

So we need a extra writer for this task that does not work recursive and 
instead use maybe a set or a map for objects that need to be written. Or the 
writer should iterate always over all objects and only write new ones.

I would prefer a new writer, because it seams to be cleaner to write objects 
from a collection instead of trying to iterate through all objects and finding 
new ones and maybe end up in loops. 

So it's a bigger task for the version 2.0

 Support incremental save
 

 Key: PDFBOX-45
 URL: https://issues.apache.org/jira/browse/PDFBOX-45
 Project: PDFBox
  Issue Type: New Feature
  Components: Writing

 [imported from SourceForge]
 http://sourceforge.net/tracker/index.php?group_id=78314atid=552835aid=1157431
 Originally submitted by purplish_cat on 2005-03-05 12:28.
 After opening a PDF file and changing objects out of it, 
 allow to save the changes incrementally to the same file 
 instead of creating a completely new file.
 [comment on SourceForge]
 Originally sent by benlitchfield.
 Logged In: YES 
 user_id=601708
 See forum thread at
 https://sourceforge.net/forum/message.php?msg_id=3032112



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Release Apache PDFBox 1.8.5

2014-05-03 Thread John Hewson
+1

-- John

On 28 Apr 2014, at 10:57, Andreas Lehmkuehler andr...@lehmi.de wrote:

 Hi,
 
 a candidate for the PDFBox 1.8.5 release is available at:
 
http://people.apache.org/~lehmi/pdfbox/1.8.5/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
 
 The SHA1 checksum of the archive is fc01acc1e2575ff1f40e44e949a862fcae076029.
 
 Please vote on releasing this package as Apache PDFBox 1.8.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 PDFBox PMC votes are cast.
 
[ ] +1 Release this package as Apache PDFBox 1.8.5
[ ] -1 Do not release this package because...
 
 
 Here is my +1
 
 BR
 Andreas Lehmkühler



[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988924#comment-13988924
 ] 

Tilman Hausherr commented on PDFBOX-1122:
-

I was able to parse that one with my own application, both with load() and 
loadNonseq() by setting -Xmx3g in the 2.0 version. With the current 1.8 
version, I could do it without modifications. Then I downloaded the 1.8.4 app 
and used the PDFReader command, and it also worked.

How do you know that Apache nutch is using 1.8.4? A look at their readme shows 
this:
https://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt
Upgrade to PDFBox 0.7.3.
And in NUTCH-1770, you write it fails at all PDFs.

 Parsing Error, Skipping Object
 --

 Key: PDFBOX-1122
 URL: https://issues.apache.org/jira/browse/PDFBOX-1122
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0
 Environment: Working with Windows 7 in eclipse.
Reporter: Raihan Jamal
Assignee: Andreas Lehmkühler
  Labels: pdfbox
   Original Estimate: 336h
  Remaining Estimate: 336h

 Parsing Error, Skipping Object
 java.io.IOException: expected='endstream' actual='' 
 org.apache.pdfbox.io.PushBackInputStream@38011d45
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.Tika.parseToString(Tika.java:357)
   at 
 edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
   at 
 edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
   at java.lang.Thread.run(Thread.java:662)
 Did not found XRef object at specified startxref position 0
 This is the sample URL where I am facing this problem:-
 http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
 Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)