[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988925#comment-13988925
 ] 

Tilman Hausherr commented on PDFBOX-1122:
-

Please insert also the full stacktrace. It can't be absolutely the same as the 
first one in this issue, because it isn't the same version and there were 
changes.

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988924#comment-13988924
 ] 

Tilman Hausherr commented on PDFBOX-1122:
-

I was able to parse that one with my own application, both with load() and 
loadNonseq() by setting -Xmx3g in the 2.0 version. With the current 1.8 
version, I could do it without modifications. Then I downloaded the 1.8.4 app 
and used the PDFReader command, and it also worked.

How do you know that Apache nutch is using 1.8.4? A look at their readme shows 
this:
https://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt
"Upgrade to PDFBox 0.7.3".
And in NUTCH-1770, you write it fails at all PDFs.

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Release Apache PDFBox 1.8.5

2014-05-03 Thread John Hewson
+1

-- John

On 28 Apr 2014, at 10:57, Andreas Lehmkuehler  wrote:

> Hi,
> 
> a candidate for the PDFBox 1.8.5 release is available at:
> 
>http://people.apache.org/~lehmi/pdfbox/1.8.5/
> 
> The release candidate is a zip archive of the sources in:
> 
>http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
> 
> The SHA1 checksum of the archive is fc01acc1e2575ff1f40e44e949a862fcae076029.
> 
> Please vote on releasing this package as Apache PDFBox 1.8.5.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 1.8.5
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> BR
> Andreas Lehmkühler



[jira] [Reopened] (PDFBOX-45) Support incremental save

2014-05-03 Thread Thomas Chojecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Chojecki reopened PDFBOX-45:
---


The saveIncremental(...) method was a first try to do this, but is 
unfortunately only work for signatures. The recursive writer make it hard to 
implement this feature, because he starts always with the catalog and if for 
example a new page was added, all elements on the way to this new page need to 
be written again.

So we need a extra writer for this task that does not work recursive and 
instead use maybe a set or a map for objects that need to be written. Or the 
writer should iterate always over all objects and only write new ones.

I would prefer a new writer, because it seams to be cleaner to write objects 
from a collection instead of trying to iterate through all objects and finding 
new ones and maybe end up in loops. 

So it's a bigger task for the version 2.0

> Support incremental save
> 
>
> Key: PDFBOX-45
> URL: https://issues.apache.org/jira/browse/PDFBOX-45
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Writing
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1157431
> Originally submitted by purplish_cat on 2005-03-05 12:28.
> After opening a PDF file and changing objects out of it, 
> allow to save the changes incrementally to the same file 
> instead of creating a completely new file.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> See forum thread at
> https://sourceforge.net/forum/message.php?msg_id=3032112



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988836#comment-13988836
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I couldn't attach the PDF to the ticket, but here's the link:

https://dl.dropboxusercontent.com/u/13175227/c-programming-a-modern-approach-2nd-edition.9780393979503.52279.pdf

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988834#comment-13988834
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I'm trying to parse several ebooks in my local filesystem using nutch, which 
makes use of tika and pdfbox to do the parsing.

No matter which file I use, I'm always getting the same error as described by 
Raihan.

Anyway, I'll be attaching one of them.

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1845) PDDocument.load() give Error: Expected a long type at offset 1633

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988831#comment-13988831
 ] 

Tilman Hausherr commented on PDFBOX-1845:
-

I uncompressed the first PDF with qpdf and now PDFBox can process it. If 
[~david.keller] wants to render this file he won't like it, because the images 
are compressed with JPEG2000 and there's a bug in the plugin.

> PDDocument.load() give Error: Expected a long type at offset 1633
> -
>
> Key: PDFBOX-1845
> URL: https://issues.apache.org/jira/browse/PDFBOX-1845
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.0, 2.0.0
> Environment: Windows 8.1
>Reporter: David KELLER
>Priority: Blocker
> Attachments: 14 01 2014-2.pdf, 14 01 2014.pdf
>
>
> I run this simple program with the file in attachment (scanned OCR document 
> from Nuance Omnipage 18)
>   public static void main(String[] args)
>   throws Exception {
>   System.out.println("Start SplitFileTest...");
>   String path = 
> "D:\\test\\batch\\scan_manual\\courrier\\david.keller\\";
>   String pdfFile = path + "14 01 2014.pdf";
>   
>   FileInputStream pdfInputStream = new FileInputStream(pdfFile);
>   
>   PDDocument pdDocument = PDDocument.load(pdfInputStream);
>   List pages = 
> pdDocument.getDocumentCatalog().getAllPages();
>   
>   pdfInputStream.close();
>   }
> And with the 1.8.0 version I have this error :
> java.io.IOException: Error: Expected an integer type, actual='12977[373'
> at 
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
> at 
> org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
> at 
> org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:604)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187)
> And I have just builded the 2.0.0 from the last code source and I have this 
> error :
>  java.io.IOException: Error: Expected a long type at offset 1633
>   at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1682)
>   at 
> org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
>   at 
> org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:663)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1101)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-45) Support incremental save

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-45.
-

Resolution: Fixed

This has been implemented in PDDocument.saveIncremental().

> Support incremental save
> 
>
> Key: PDFBOX-45
> URL: https://issues.apache.org/jira/browse/PDFBOX-45
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Writing
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1157431
> Originally submitted by purplish_cat on 2005-03-05 12:28.
> After opening a PDF file and changing objects out of it, 
> allow to save the changes incrementally to the same file 
> instead of creating a completely new file.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> See forum thread at
> https://sourceforge.net/forum/message.php?msg_id=3032112



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-6) PDF to HTML conversion

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-6.


Resolution: Implemented

This has been implemented long ago by John J. Barton in PDFText2HTML.java.

> PDF to HTML conversion
> --
>
> Key: PDFBOX-6
> URL: https://issues.apache.org/jira/browse/PDFBOX-6
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Utilities
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=802407
> Originally submitted by winstanley_john on 2003-09-08 04:42.
> PDF to HTML conversion. 
> Conserve formating.
> check out www.sourceforge.net/projects/pdftohtml 
> for a hack of this process.
> [comment on SourceForge]
> Originally sent by winstanley_john.
> Logged In: YES 
> user_id=747013
> Also conversion to xml or word etc would be amazing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-9) HTML -> PDF

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-9.


Resolution: Won't Fix

This is beyond the scope of PDFBox. You can print your HTML to a virtual 
printer who will create a PDF.

> HTML -> PDF
> ---
>
> Key: PDFBOX-9
> URL: https://issues.apache.org/jira/browse/PDFBOX-9
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Utilities
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=841169
> Originally submitted by nobody on 2003-11-12 20:03.
> It would be really nice to take a html and create a PDF 
> from it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1731) Converting pdf to Image

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988814#comment-13988814
 ] 

Tilman Hausherr commented on PDFBOX-1731:
-

[~paulocamargomello] does your problem still happen with the current (just 
released) version? We have lessened the memory footprint somewhat.

> Converting pdf to Image
> ---
>
> Key: PDFBOX-1731
> URL: https://issues.apache.org/jira/browse/PDFBOX-1731
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.2
> Environment: Windows 8  and Linux 
> JDK 1.7
>Reporter: Paulo R C Mello Junior
>  Labels: newbie
>
> I'm trying to convert a pdf page to image but an exception occurs:
> 17:28:20,652 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] 
> (Thread-69) Something went wrong ... the pixelmap doesn't contain any data.
> 17:28:20,654 WARN  [org.apache.pdfbox.util.operator.pagedrawer.Invoke] 
> (Thread-69) getRGBImage returned NULL
> 17:28:20,661 INFO  [org.apache.pdfbox.util.PDFStreamEngine] (Thread-69) 
> unsupported/disabled operation: i
> 17:28:36,809 ERROR [stderr] (Thread-70) Exception in thread "Thread-70" 
> java.lang.OutOfMemoryError: Java heap space
> 17:28:36,811 ERROR [stderr] (Thread-70)   at 
> java.awt.image.DataBufferByte.(DataBufferByte.java:92)
> 17:28:36,812 ERROR [stderr] (Thread-70)   at 
> java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:415)
> 17:28:36,814 ERROR [stderr] (Thread-70)   at 
> java.awt.image.Raster.createWritableRaster(Raster.java:941)
> 17:28:36,814 ERROR [stderr] (Thread-70)   at 
> javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1073)
> 17:28:36,815 ERROR [stderr] (Thread-70)   at 
> javax.imageio.ImageReader.getDestination(ImageReader.java:2896)
> 17:28:36,816 ERROR [stderr] (Thread-70)   at 
> com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1066)
> 17:28:36,817 ERROR [stderr] (Thread-70)   at 
> com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1034)
> 17:28:36,818 ERROR [stderr] (Thread-70)   at 
> javax.imageio.ImageIO.read(ImageIO.java:1448)
> 17:28:36,818 ERROR [stderr] (Thread-70)   at 
> javax.imageio.ImageIO.read(ImageIO.java:1352)
> 17:28:36,819 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg.getRGBImage(PDJpeg.java:264)
> 17:28:36,820 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:83)
> 17:28:36,821 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
> 17:28:36,823 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> 17:28:36,824 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> 17:28:36,825 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> 17:28:36,826 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:125)
> 17:28:36,827 ERROR [stderr] (Thread-70)   at 
> org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:769)
> My code:
> public static List getPdfPagesAsImages(String pdfPath)
>   throws IOException {
>   File f = new File(pdfPath);
>   PDDocument pdfDocument = null;
>   pdfDocument = PDDocument.loadNonSeq(f, null);
>   List bImages = new ArrayList();
>   try {
>   System.out.println(pdfPath);
>   int resolution = 185;
>   if (pdfDocument != null) {
>   @SuppressWarnings("unchecked")
>   List pages = (List) pdfDocument
>   
> .getDocumentCatalog().getAllPages();
>   for (PDPage p : pages) {
>   BufferedImage convertedImage = 
> p.convertToImage(
>   
> BufferedImage.TYPE_INT_RGB, resolution);
>   if (isNegativeImage(convertedImage)) {
>   
> bImages.add(invertNegativeImage(convertedImage));
>   } else {
>   bImages.add(convertedImage);
>   }
>   }
>   }
>   } catch (FileNotFoundException e) {
>   e.printStackTrace();
>   e.getMessage();
> 

[jira] [Closed] (PDFBOX-167) wrong words highlighted

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-167.
--

Resolution: Cannot Reproduce

On october 2013, I e-mailed both people mentioned in this issue:
{quote}
Is this still an issue? I looked at the code and it is different than the one 
mentioned. But I can't test the code mentioned because the links are broken.
{quote}
I never got a response. I am thus closing this issue.

> wrong words highlighted
> ---
>
> Key: PDFBOX-167
> URL: https://issues.apache.org/jira/browse/PDFBOX-167
> Project: PDFBox
>  Issue Type: Bug
>Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1487217
> Originally submitted by nobody on 2006-05-12 01:51.
> PDFBox appears to have a problem properly highlighting
> words from the following PDF. I am using a very simple
> servlet to do this, and it works fine for most PDFs.
> With this one, however, it highlights the wrong words.
> Unfortunately I am not smart enough to figure out what
> is going on myself, so could anybody help me with this?
> The files can be found here:
> http://www.impressie.nl/matthijs/PDFHighlight.java
> http://www.impressie.nl/matthijs/Rectificatie%20van%20Richtlijn%20Handhaving%20van%20Intellectuele-eigendomsrechten.pdf
> Matthijs Bierman
> matth...@impressie.nl
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO 
> That document is in a password-protected area, so it can't be read by anyone 
> else! I have a similar problem with this doc:
> http://www.usc.edu/schools/business/FBE/seminars/papers/AE_4-28-06_FISMAN-parking.pdf
> ... but I think I've figured this one out. The second page of this document 
> is entirely blank, and checking by hand I can see that the highlights after 
> p1 are all in positions that would be correct if they were one page further 
> on; it appears that the page count isn't being incremented for the blank 
> page. Tracing this back in the code I see this:
> PDStream contentStream = nextPage.getContents();
> if( contentStream != null )
> {
> COSStream contents = contentStream.getStream();
> processPage( nextPage, contents );
> }
> (PDFTextStripper.java line 255). That's skipping the blank page and giving me 
> the wrong page no, I think - and I guess that the problem can be resolved by 
> moving currentPageNo++ from inside processPage to just above that test.
> -- brian.ew...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2014) PDAnnotationLink

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988811#comment-13988811
 ] 

Tilman Hausherr commented on PDFBOX-2014:
-

Do you have a sample PDF that has what you want?

> PDAnnotationLink 
> -
>
> Key: PDFBOX-2014
> URL: https://issues.apache.org/jira/browse/PDFBOX-2014
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: WALID CHARFI
>
> Hi,
> I want to draw a text link without any hover effect, neither solid border.
> I tried this code but it does not work.
> Could you provide me with a solution please? thank you very much.
> PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary();
> borderULine.setStyle(PDBorderStyleDictionary.STYLE_INSET); 
> PDAnnotationLink txtLink = new PDAnnotationLink();
> txtLink.setRectangle(position);
> PDActionURI action = new PDActionURI();
> action.setURI(pdfPara.getUri());
> txtLink.setAction(action);
> txtLink.setBorderStyle(borderULine);
> txtLink.setHighlightMode(PDAnnotationLink.HIGHLIGHT_MODE_NONE);



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-1961) Page with annotations renders fine with 1.8 but not with 2.0

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1961:


Labels: Annotations regression  (was: regression)

> Page with annotations renders fine with 1.8 but not with 2.0
> 
>
> Key: PDFBOX-1961
> URL: https://issues.apache.org/jira/browse/PDFBOX-1961
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: Annotations, regression
> Fix For: 2.0.0
>
> Attachments: annots.pdf, annots.pdf-2-v18.png, annots.pdf-2-v2.png
>
>
> Page 2 of the attached PDF (from a ghostscript installation) renders fine 
> with 1.8 but not with 2.0. The other pages are not rendered properly with any 
> version.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-2033) Narrow long pdf is printed blank

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2033.
---

Resolution: Not a Problem

Thanks, closing this, in retrospect, this was rather a howto question and not a 
bug.

> Narrow long pdf is printed blank
> 
>
> Key: PDFBOX-2033
> URL: https://issues.apache.org/jira/browse/PDFBOX-2033
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
> Environment: W7
>Reporter: Tilman Hausherr
>
> Based on the post of Norbert Sándor to the user list:
> When printing a 198.425 x 1700.787 sized page (70 x 600 mm) using new 
> PDFPrinter(pdfDocument, printJob).silentPrint() to a virtual printer (e.g. 
> PDFCreator or CIB), then the resulting PDF has a page size of 8,26x11,69 and 
> the content is horizontally centered on the page. I was able to reproduce the 
> problem, and also to print on a virtual printer that creates new PDFs of that 
> size, by using the longest constructor of PDFPrinter(), but then the output 
> is blank.
> While it doesn't seem useful to print a PDF to a PDF, the problem might make 
> sense when printing to a cash register receipt printer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988774#comment-13988774
 ] 

Tilman Hausherr commented on PDFBOX-2045:
-

I tried your test with the file at
http://www.studentenwerk-berlin.de/wohnen/dokumente/41%20%7C%20Anmeldung.pdf
which has a form on page 3 and 4. After merging, all pages can be displayed 
with Acrobat, but the form capability is lost.

> Merging PDFs has no effect
> --
>
> Key: PDFBOX-2045
> URL: https://issues.apache.org/jira/browse/PDFBOX-2045
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Gerhard Temper
> Attachments: specialpdf.pdf
>
>
> Merging attached special PDF (a form) results in a PDF consisting only of the 
> PDF form ignoring all other PDFs without any error.
> Command line to reproduce the problem:
> java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988772#comment-13988772
 ] 

Tilman Hausherr commented on PDFBOX-2045:
-

I slighly clarified your text above to mention that it is a form.

The weird thing is that PDFBox renders all pages, and so does GSView. Only 
Acrobat Viewer doesn't.

> Merging PDFs has no effect
> --
>
> Key: PDFBOX-2045
> URL: https://issues.apache.org/jira/browse/PDFBOX-2045
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Gerhard Temper
> Attachments: specialpdf.pdf
>
>
> Merging attached special PDF (a form) results in a PDF consisting only of the 
> PDF form ignoring all other PDFs without any error.
> Command line to reproduce the problem:
> java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: 1.8.5 and JIRA

2014-05-03 Thread Andreas Lehmkuehler

Hi,

Am 03.05.2014 07:45, schrieb Tilman Hausherr:

Hallo Andreas,

Thanks for all your work; only one thing is missing, 1.8.5 is still listed as
"unreleased version" in JIRA, e.g. here:
https://issues.apache.org/jira/browse/PDFBOX/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

I wanted to wait until to release is official announced. Thanks for the 
reminder.


Tilman

Am 02.05.2014 09:27, schrieb Andreas Lehmkühler:

Hi,

due to the newest PDFBox 1.8.5 release I've closed all 1.8.5 related issues
in a bulk operation. I've disabled the email notification to avoid an email
flood.
I've also added the all new version 1.8.6 for our next bugfix release ...

I'll update the download page once the mirrors copied the version from our
repository.

BR
Andreas Lehmkühler




BR
Andreas Lehmkühler


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Description: 
Merging attached special PDF (a form) results in a PDF consisting only of the 
PDF form ignoring all other PDFs without any error.

Command line to reproduce the problem:
java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf

  was:
Merging attached PDF results in a PDF consisting only of the special PDF 
ignoring all other PDFs without any error.

Command line to reproduce the problem:
java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf


> Merging PDFs has no effect
> --
>
> Key: PDFBOX-2045
> URL: https://issues.apache.org/jira/browse/PDFBOX-2045
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Gerhard Temper
> Attachments: specialpdf.pdf
>
>
> Merging attached special PDF (a form) results in a PDF consisting only of the 
> PDF form ignoring all other PDFs without any error.
> Command line to reproduce the problem:
> java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Affects Version/s: 2.0.0
   1.8.6
   1.8.5

> Merging PDFs has no effect
> --
>
> Key: PDFBOX-2045
> URL: https://issues.apache.org/jira/browse/PDFBOX-2045
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Gerhard Temper
> Attachments: specialpdf.pdf
>
>
> Merging attached PDF results in a PDF consisting only of the special PDF 
> ignoring all other PDFs without any error.
> Command line to reproduce the problem:
> java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-2045) Merging PDFs has no effect

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2045:


Component/s: AcroForm

> Merging PDFs has no effect
> --
>
> Key: PDFBOX-2045
> URL: https://issues.apache.org/jira/browse/PDFBOX-2045
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Utilities
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Gerhard Temper
> Attachments: specialpdf.pdf
>
>
> Merging attached PDF results in a PDF consisting only of the special PDF 
> ignoring all other PDFs without any error.
> Command line to reproduce the problem:
> java -jar pdfbox-app-1.8.4.jar PDFMerger page1.pdf specialpdf.pdf result.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2054.
-

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.8.6

I'm done. I didn't touch preflight, examples and the GUI tools. Thanks for 
pointing us to these "legacy" problems.

> Remove System.out.println()
> ---
>
> Key: PDFBOX-2054
> URL: https://issues.apache.org/jira/browse/PDFBOX-2054
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Hong-Thai Nguyen
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 1.8.6, 2.0.0
>
>
> For example at GlyfSimpleDescript.java
> {code}
> ...
> catch (ArrayIndexOutOfBoundsException e)
> {
> System.out.println("error: array index out of bounds");
> }
> {code}
> and also 'printStackTrace' like in PageDrawer.java:
> {code}
> ...
> catch( IOException io )
> {
> io.printStackTrace();
> }
> {code}
> Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988749#comment-13988749
 ] 

Tilman Hausherr edited comment on PDFBOX-2054 at 5/3/14 5:28 PM:
-

Committed a second round in rev 1592251 for the trunk and rev 1592252 & 1592254 
for the 1.8 branch.


was (Author: tilman):
Committed a second round in rev 1592251 for the trunk and rev 1592252 for the 
1.8 branch.

> Remove System.out.println()
> ---
>
> Key: PDFBOX-2054
> URL: https://issues.apache.org/jira/browse/PDFBOX-2054
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Hong-Thai Nguyen
>Assignee: Tilman Hausherr
>Priority: Minor
>
> For example at GlyfSimpleDescript.java
> {code}
> ...
> catch (ArrayIndexOutOfBoundsException e)
> {
> System.out.println("error: array index out of bounds");
> }
> {code}
> and also 'printStackTrace' like in PageDrawer.java:
> {code}
> ...
> catch( IOException io )
> {
> io.printStackTrace();
> }
> {code}
> Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2054) Remove System.out.println()

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988749#comment-13988749
 ] 

Tilman Hausherr commented on PDFBOX-2054:
-

Committed a second round in rev 1592251 for the trunk and rev 1592252 for the 
1.8 branch.

> Remove System.out.println()
> ---
>
> Key: PDFBOX-2054
> URL: https://issues.apache.org/jira/browse/PDFBOX-2054
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.4, 1.8.5, 1.8.6, 2.0.0
>Reporter: Hong-Thai Nguyen
>Assignee: Tilman Hausherr
>Priority: Minor
>
> For example at GlyfSimpleDescript.java
> {code}
> ...
> catch (ArrayIndexOutOfBoundsException e)
> {
> System.out.println("error: array index out of bounds");
> }
> {code}
> and also 'printStackTrace' like in PageDrawer.java:
> {code}
> ...
> catch( IOException io )
> {
> io.printStackTrace();
> }
> {code}
> Should forward exception or keep silence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-1584.
-

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.8.6

Committed in rev 1592225 for the trunk and 1592226 for the 1.8 branch. Thanks! 
I won't commit PDFBOX-1582 for now as this doesn't seem to be complete.

> Add unit test for RandomAccessFileOutputStream
> --
>
> Key: PDFBOX-1584
> URL: https://issues.apache.org/jira/browse/PDFBOX-1584
> Project: PDFBox
>  Issue Type: Test
>  Components: Writing
>Affects Versions: 1.8.1
>Reporter: Fredrik Kjellberg
>Priority: Minor
> Fix For: 1.8.6, 2.0.0
>
> Attachments: TestRandomAccessFileOutputStream_diff.txt
>
>
> This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988685#comment-13988685
 ] 

Tilman Hausherr commented on PDFBOX-1122:
-

The old URL no longer works. What file did you use?

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1122) Parsing Error, Skipping Object

2014-05-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988671#comment-13988671
 ] 

Rogério Pereira Araújo commented on PDFBOX-1122:


I can confirm the same error on version 1.8.4 while parsing PDFs with Tika 
during Nutch parsing job.

> Parsing Error, Skipping Object
> --
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
>Reporter: Raihan Jamal
>Assignee: Andreas Lehmkühler
>  Labels: pdfbox
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@38011d45
>   at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
>   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>   at org.apache.tika.Tika.parseToString(Tika.java:357)
>   at 
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
>   at 
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
>   at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
>   at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (PDFBOX-2053) Issue with PDFBox position reading

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2053.
---

Resolution: Duplicate

> Issue with PDFBox position reading
> --
>
> Key: PDFBOX-2053
> URL: https://issues.apache.org/jira/browse/PDFBOX-2053
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.3
>Reporter: Orbel Mkrtchyan
> Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
>   PDDocument doc = new PDDocument();
>   doc.load("test-pcc7247.pdf");
>   doc.save("out.pdf");
>   doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by 
> Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
>   Extractor extractor = new Extractor();
>   extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
> protected void writePage() throws IOException
> {
> for( int i = 0; i < charactersByArticle.size(); i++)
> {
> List textList = charactersByArticle.get( i );
> Iterator textIter = textList.iterator();
> while( textIter.hasNext() )
> {
> TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the 
> letters of the first line of the provided pdf document, but its coordinates 
> (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
> relates to 1 letter, and its widths array's length always equals 1. So we get 
> the same coordinates for every letter in a line. Expected behaviour is either 
> having new coordinates per letter or having widths[] contain widths for the 
> characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988657#comment-13988657
 ] 

Tilman Hausherr commented on PDFBOX-2053:
-

I updated my fix for PDFBOX-62 and rendering works for me. The problems that 
you get with your Extractor class are there because of the zero widths problem.

Re bug1: correct your code to

{code}
PDDocument doc = PDDocument.load("test-pcc7247.pdf");
doc.save("out.pdf");
doc.close();
{code}


> Issue with PDFBox position reading
> --
>
> Key: PDFBOX-2053
> URL: https://issues.apache.org/jira/browse/PDFBOX-2053
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.3
>Reporter: Orbel Mkrtchyan
> Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
>   PDDocument doc = new PDDocument();
>   doc.load("test-pcc7247.pdf");
>   doc.save("out.pdf");
>   doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by 
> Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
>   Extractor extractor = new Extractor();
>   extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
> protected void writePage() throws IOException
> {
> for( int i = 0; i < charactersByArticle.size(); i++)
> {
> List textList = charactersByArticle.get( i );
> Iterator textIter = textList.iterator();
> while( textIter.hasNext() )
> {
> TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the 
> letters of the first line of the provided pdf document, but its coordinates 
> (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
> relates to 1 letter, and its widths array's length always equals 1. So we get 
> the same coordinates for every letter in a line. Expected behaviour is either 
> having new coordinates per letter or having widths[] contain widths for the 
> characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-62:
--

Attachment: PDTrueTypeFont.diff

Updated patch to handle the file from PDFBOX-2053, it maps missing Arial withs 
to Helvetica AFM files.

> Incorrect (zero) character widths returned in some docs
> ---
>
> Key: PDFBOX-62
> URL: https://issues.apache.org/jira/browse/PDFBOX-62
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Assignee: Andreas Lehmkühler
> Attachments: 5542.pdf, PDTrueTypeFont.diff, 
> pdfbox-2006-zerowidth.pdf-1.png, pdfbox-62-zerowidth.pdf-1.png
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1216674
> Originally submitted by tamirhassan on 2005-06-07 13:42.
> For certain PDF documents (such as the one attached) 
> the character/string widths (as obtained e.g. by the 
> PDFont.getStringWidth method) are not returned 
> correctly, i.e. they appear to be correct for punctuation 
> characters but are zero for alphanumeric characters.  
> It seems as if these alphanumeric characters are NOT 
> within PDFont.firstChar and PDFont.lastChar in the 
> Type 1 font.  The method therefore attempts to obtain 
> the font widths from the AFM (font metric) file, but fails 
> (silently) with a 'resource is null' logline message.
> (Note that this problem doesn't seem to occur with Type 
> 1 fonts in other documents.)
> A more detailed discussion regarding this issue can be 
> found in this link:
> http://sourceforge.net/forum/forum.php?
> thread_id=1260349&forum_id=267205
> Thanks in advance for any help that can be obtained,
> Tam



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-62) Incorrect (zero) character widths returned in some docs

2014-05-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-62:
--

Attachment: (was: PDTrueTypeFont.diff)

> Incorrect (zero) character widths returned in some docs
> ---
>
> Key: PDFBOX-62
> URL: https://issues.apache.org/jira/browse/PDFBOX-62
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Assignee: Andreas Lehmkühler
> Attachments: 5542.pdf, pdfbox-2006-zerowidth.pdf-1.png, 
> pdfbox-62-zerowidth.pdf-1.png
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1216674
> Originally submitted by tamirhassan on 2005-06-07 13:42.
> For certain PDF documents (such as the one attached) 
> the character/string widths (as obtained e.g. by the 
> PDFont.getStringWidth method) are not returned 
> correctly, i.e. they appear to be correct for punctuation 
> characters but are zero for alphanumeric characters.  
> It seems as if these alphanumeric characters are NOT 
> within PDFont.firstChar and PDFont.lastChar in the 
> Type 1 font.  The method therefore attempts to obtain 
> the font widths from the AFM (font metric) file, but fails 
> (silently) with a 'resource is null' logline message.
> (Note that this problem doesn't seem to occur with Type 
> 1 fonts in other documents.)
> A more detailed discussion regarding this issue can be 
> found in this link:
> http://sourceforge.net/forum/forum.php?
> thread_id=1260349&forum_id=267205
> Thanks in advance for any help that can be obtained,
> Tam



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988643#comment-13988643
 ] 

Tilman Hausherr commented on PDFBOX-1584:
-

No. This happens sometimes :-(

I looked at the patch, then I looked at the timeline of the submitter 
([~frkj]), this is actually part of a bigger discussion about problems with 
RandomAccessFileOutputStream and RandomAccessFileInputStream. I will try to 
understand what this is about. Be patient

> Add unit test for RandomAccessFileOutputStream
> --
>
> Key: PDFBOX-1584
> URL: https://issues.apache.org/jira/browse/PDFBOX-1584
> Project: PDFBox
>  Issue Type: Test
>  Components: Writing
>Affects Versions: 1.8.1
>Reporter: Fredrik Kjellberg
>Priority: Minor
> Attachments: TestRandomAccessFileOutputStream_diff.txt
>
>
> This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Ajay Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988605#comment-13988605
 ] 

Ajay Bhat commented on PDFBOX-1584:
---

Has this patch been commited yet?

> Add unit test for RandomAccessFileOutputStream
> --
>
> Key: PDFBOX-1584
> URL: https://issues.apache.org/jira/browse/PDFBOX-1584
> Project: PDFBox
>  Issue Type: Test
>  Components: Writing
>Affects Versions: 1.8.1
>Reporter: Fredrik Kjellberg
>Priority: Minor
> Attachments: TestRandomAccessFileOutputStream_diff.txt
>
>
> This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1584) Add unit test for RandomAccessFileOutputStream

2014-05-03 Thread Ajay Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988605#comment-13988605
 ] 

Ajay Bhat edited comment on PDFBOX-1584 at 5/3/14 7:07 AM:
---

Has this patch been committed?


was (Author: ajay bhat):
Has this patch been commited yet?

> Add unit test for RandomAccessFileOutputStream
> --
>
> Key: PDFBOX-1584
> URL: https://issues.apache.org/jira/browse/PDFBOX-1584
> Project: PDFBox
>  Issue Type: Test
>  Components: Writing
>Affects Versions: 1.8.1
>Reporter: Fredrik Kjellberg
>Priority: Minor
> Attachments: TestRandomAccessFileOutputStream_diff.txt
>
>
> This patch includes a unit test for RandomAccessFileOutputStream



--
This message was sent by Atlassian JIRA
(v6.2#6252)