[jira] [Commented] (PDFBOX-3732) IllegalArgumentException when refreshing an appearance and no font resources are defined

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005940#comment-16005940
 ] 

ASF subversion and git services commented on PDFBOX-3732:
-

Commit 1794785 from [~msahyoun] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794785 ]

PDFBOX-3732: ensure default entries for /DA and /DR when accessing an AcroForm 
and the form doesn't contain these

> IllegalArgumentException when refreshing an appearance and no font resources 
> are defined
> 
>
> Key: PDFBOX-3732
> URL: https://issues.apache.org/jira/browse/PDFBOX-3732
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.5
>Reporter: simon steiner
> Fix For: 2.0.6, 3.0.0
>
> Attachments: out.pdf, out-reader.pdf, PDFBOX3732-minimal.pdf, 
> PDFBOX3732-minimal-reader.pdf, refreshAppearances.patch
>
>
> PDDocument doc = PDDocument.load(new File("out.pdf"));
> doc.getDocumentCatalog().getAcroForm().setNeedAppearances(false);
> doc.getDocumentCatalog().getAcroForm().refreshAppearances();
> doc.save("pdfbox.pdf");
> doc.close();
> Exception in thread "main" java.lang.IllegalArgumentException: /DR is a 
> required entry
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.(PDDefaultAppearanceString.java:82)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005924#comment-16005924
 ] 

Andreas Lehmkühler commented on PDFBOX-3788:


I already had this feeling when I went to bed yesterday that my changes might 
be a bad idea ... however, I've reverted my changes.

Thanks [~tilman] for the pointer.

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: genko_oc_shiryo1.pdf, 
> YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3732) IllegalArgumentException when refreshing an appearance and no font resources are defined

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005921#comment-16005921
 ] 

ASF subversion and git services commented on PDFBOX-3732:
-

Commit 1794784 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794784 ]

PDFBOX-3732: ensure default entries for /DA and /DR when accessing an AcroForm 
and the form doesn't contain these

> IllegalArgumentException when refreshing an appearance and no font resources 
> are defined
> 
>
> Key: PDFBOX-3732
> URL: https://issues.apache.org/jira/browse/PDFBOX-3732
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.5
>Reporter: simon steiner
> Fix For: 2.0.6, 3.0.0
>
> Attachments: out.pdf, out-reader.pdf, PDFBOX3732-minimal.pdf, 
> PDFBOX3732-minimal-reader.pdf, refreshAppearances.patch
>
>
> PDDocument doc = PDDocument.load(new File("out.pdf"));
> doc.getDocumentCatalog().getAcroForm().setNeedAppearances(false);
> doc.getDocumentCatalog().getAcroForm().refreshAppearances();
> doc.save("pdfbox.pdf");
> doc.close();
> Exception in thread "main" java.lang.IllegalArgumentException: /DR is a 
> required entry
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.(PDDefaultAppearanceString.java:82)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005919#comment-16005919
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794783 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794783 ]

PDFBOX-3788: revert former changes due to a regression

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: genko_oc_shiryo1.pdf, 
> YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005918#comment-16005918
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794782 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794782 ]

PDFBOX-3788: revert former changes due to a regression

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: genko_oc_shiryo1.pdf, 
> YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-3789.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.0.6

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005514#comment-16005514
 ] 

Tilman Hausherr commented on PDFBOX-3789:
-

I could also have rewritten containsKey() to return false if the entry is null 
but this isn't the same.

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3789:

Affects Version/s: (was: 2.0.6)

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005512#comment-16005512
 ] 

ASF subversion and git services commented on PDFBOX-3789:
-

Commit 1794767 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794767 ]

PDFBOX-3789: treat /WIDTHS with null entry as if /WIDTHS was missing

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005511#comment-16005511
 ] 

ASF subversion and git services commented on PDFBOX-3789:
-

Commit 1794766 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794766 ]

PDFBOX-3789: treat /WIDTHS with null entry as if /WIDTHS was missing

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005498#comment-16005498
 ] 

Tilman Hausherr commented on PDFBOX-3788:
-

More problems:
- PDFBOX-3714-2.pdf , the signature can no longer be seen
- PDFBOX-2990 and PDFBOX-3369 same exception


> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: genko_oc_shiryo1.pdf, 
> YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3788:

Attachment: genko_oc_shiryo1.pdf

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: genko_oc_shiryo1.pdf, 
> YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Reopened] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-3788:
-

{code}
java.io.IOException: Missing root object specification in trailer.

org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2128)
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:227)
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1005)
{code}
with the attached file

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3783) java.io.IOException: Expected root dictionary, but got this: COSNull{}

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005462#comment-16005462
 ] 

ASF subversion and git services commented on PDFBOX-3783:
-

Commit 1794764 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794764 ]

PDFBOX-3783: removed misleading comment

> java.io.IOException: Expected root dictionary, but got this: COSNull{}
> --
>
> Key: PDFBOX-3783
> URL: https://issues.apache.org/jira/browse/PDFBOX-3783
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: PDFBOX-3783-72GLBIGUC6LB46ELZFBARRJTLN4RBSQM.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> java.io.IOException: Expected root dictionary, but got this: COSNull{}
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:230)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1005)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:943)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1375)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1293)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1276)
> org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:262)
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3783) java.io.IOException: Expected root dictionary, but got this: COSNull{}

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005443#comment-16005443
 ] 

ASF subversion and git services commented on PDFBOX-3783:
-

Commit 1794762 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794762 ]

PDFBOX-3783: removed misleading comment

> java.io.IOException: Expected root dictionary, but got this: COSNull{}
> --
>
> Key: PDFBOX-3783
> URL: https://issues.apache.org/jira/browse/PDFBOX-3783
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: PDFBOX-3783-72GLBIGUC6LB46ELZFBARRJTLN4RBSQM.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> java.io.IOException: Expected root dictionary, but got this: COSNull{}
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:230)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1005)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:943)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1375)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1293)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1276)
> org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:262)
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3789:

Component/s: PDModel

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3789:

Description: The text in the table is missing, it was there in 2.0.5. I 
suspect it is due to the missing width (Adobe mentions it). The file is 
truncated but is parsed; the error happens also when saving the parsed file and 
rendering that one.  (was: The text in the table is missing. I suspect it is 
due to the missing width (Adobe mentions it). The file is truncated but is 
parsed; the error happens also when saving the parsed file and rendering that 
one.)

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing, it was there in 2.0.5. I suspect it is due 
> to the missing width (Adobe mentions it). The file is truncated but is 
> parsed; the error happens also when saving the parsed file and rendering that 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3789:

Labels: regression  (was: )

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>  Labels: regression
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing. I suspect it is due to the missing width 
> (Adobe mentions it). The file is truncated but is parsed; the error happens 
> also when saving the parsed file and rendering that one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3789:

Attachment: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf
PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf

> Some text missing in rendering
> --
>
> Key: PDFBOX-3789
> URL: https://issues.apache.org/jira/browse/PDFBOX-3789
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.6
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
> Attachments: PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V.pdf, 
> PDFBOX-3789-4KBI7ITHG6MSXR7DOTKZX6DQZJ5UF64V_unc.pdf
>
>
> The text in the table is missing. I suspect it is due to the missing width 
> (Adobe mentions it). The file is truncated but is parsed; the error happens 
> also when saving the parsed file and rendering that one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3789) Some text missing in rendering

2017-05-10 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-3789:
---

 Summary: Some text missing in rendering
 Key: PDFBOX-3789
 URL: https://issues.apache.org/jira/browse/PDFBOX-3789
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.6
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr


The text in the table is missing. I suspect it is due to the missing width 
(Adobe mentions it). The file is truncated but is parsed; the error happens 
also when saving the parsed file and rendering that one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-3788.

Resolution: Fixed

I've removed the repair mechanism which was triggered during parsing the xref 
information. Now, an IOException is thrown and the rebuildTrailer mechanism is 
called.

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005305#comment-16005305
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794754 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794754 ]

PDFBOX-3788: optimized debug message

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005304#comment-16005304
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794753 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794753 ]

PDFBOX-3788: optimized debug message

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005294#comment-16005294
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794751 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1794751 ]

PDFBOX-3788: remove repair mechanism when parsing the xref information, trigger 
rebuilding the trailer instead

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-3788:
---
Fix Version/s: 3.0.0
   2.0.6

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.6, 3.0.0
>
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005284#comment-16005284
 ] 

ASF subversion and git services commented on PDFBOX-3788:
-

Commit 1794750 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1794750 ]

PDFBOX-3788: remove repair mechanism when parsing the xref information, trigger 
rebuilding the trailer instead

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-10 Thread Andreas Lehmkuehler

Am 10.05.2017 um 17:12 schrieb Tilman Hausherr:
Thanks for the test... the sum is still negative, but if we'd ignore the 
truncated files I bet we'd be positive.


I have downloaded a few of the regressions but won't create issues this time as 
yesterday's turned out to be duplicates, I'll wait for Andreas next commit and 
will create issues only if these aren't solved.
I guess the new exception aren't related. I've already created an issue for the 
first one, PDFBOX-3788
I didn't had a chance to look at the second file. I just tested my fix for the 
first one and it still fails.



@Andreas - ping me if you didn't keep the "secret" URL.

It isn't that secret as Tim posted it somewhere in this thread ...



Some misc thoughts...

039800.pdf: "refinery's" is a different token than refinery. Shouldn't 
"refinery's" be three tokens? I mention this because refinery is probably in a 
dictionary.


Some differences are because of a different treatment of the space in bad fonts. 
Some were improved, and some now look like this "C I T I E S W I T H O U T D R U 
G S". There is an open issue about these. It is tricky because if we treat these 
like 1 word, we'd also lose spaces where we don't want.


commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used 
http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z


Tilman

Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:

Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005178#comment-16005178
 ] 

Andreas Lehmkühler commented on PDFBOX-3788:


I've already found the cause and a possible solution, but some of the isator 
tests fail.

> java.lang.RuntimeException: java.io.IOException: Catalog cannot be found
> 
>
> Key: PDFBOX-3788
> URL: https://issues.apache.org/jira/browse/PDFBOX-3788
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.6
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf
>
>
> This file was parsed in 2.0.5 but no longer now:
> {code}
> Caused by: java.io.IOException: Catalog cannot be found
> org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
> 
> org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3788) java.lang.RuntimeException: java.io.IOException: Catalog cannot be found

2017-05-10 Thread JIRA
Andreas Lehmkühler created PDFBOX-3788:
--

 Summary: java.lang.RuntimeException: java.io.IOException: Catalog 
cannot be found
 Key: PDFBOX-3788
 URL: https://issues.apache.org/jira/browse/PDFBOX-3788
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.6
Reporter: Andreas Lehmkühler
Assignee: Andreas Lehmkühler
 Attachments: YVFDWHF767TEYTT7IVFSLUIJTDF3YP57.pdf

This file was parsed in 2.0.5 but no longer now:
{code}
Caused by: java.io.IOException: Catalog cannot be found
org.apache.pdfbox.cos.COSDocument.getCatalog(COSDocument.java:373)
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:238)
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:310)
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000)
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:938)
org.apache.pdfbox.debugger.PDFDebugger.parseDocument(PDFDebugger.java:1288)
org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1209)
org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1194)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004913#comment-16004913
 ] 

Tilman Hausherr commented on PDFBOX-3782:
-

That problem is also with Adobe Reader:
{code}
such as“BC/AD”,“a.m./p.m.”,“FBI”, and“CD”
{code}
The spaces are missing because some of the glyphs have larger width than what 
is black. You can see this by marking the quote before FBI.

In theory, we could calculate our own widths from the font paths instead of 
trusting the fonts. But this might bring some new surprises. (And it would be 
slower)

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tony Bray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004877#comment-16004877
 ] 

Tony Bray commented on PDFBOX-3782:
---

It seems to be around the quotes and other punctuation.  Here's an example 
sentence with spacing not honored:
To a lesser extent, modern written Japanese also uses acronyms from the Latin 
alphabet, for example in terms such as“BC/AD”,“a.m./p.m.”,“FBI”, and“CD” . 

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004860#comment-16004860
 ] 

Tilman Hausherr commented on PDFBOX-3782:
-

Can you tell what part "resisted" the extraction?

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004860#comment-16004860
 ] 

Tilman Hausherr edited comment on PDFBOX-3782 at 5/10/17 3:31 PM:
--

Can you tell what part "resisted" the extraction with the modified parameter?


was (Author: tilman):
Can you tell what part "resisted" the extraction?

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004859#comment-16004859
 ] 

Tilman Hausherr commented on PDFBOX-3782:
-

Some background: in this font, the space width is not defined. So a default 
width is taken, which is 600 in this font. That is unusually large. The "0" has 
500, and in fonts the space has usually a width around 250. Because of that 
large space width, PDFBox assumes that the area between glyphs isn't a space.

Now you may argue: "I don't care, Adobe does it correctly so I want it here 
too".

Our algorithm does a lot of "magic" and changing it is risky, because it may 
degrade files that were good... I don't have any good idea right now, but will 
keep this open and add your file to my test set.

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tony Bray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004855#comment-16004855
 ] 

Tony Bray commented on PDFBOX-3782:
---

Hi and thank you.  I tried extraction with the "setSpacingTolerance" and it 
helped but was not 100%.

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3782:

Affects Version/s: 2.0.6
   2.0.5

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3782) Text extraction loses whitespace

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3782:

Summary: Text extraction loses whitespace  (was: WARNING: No Unicode 
mapping for CID+0 (0) in font RGOFPX+IPAexMincho)

> Text extraction loses whitespace
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3782) WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho

2017-05-10 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3782:

Component/s: (was: Parsing)
 Text extraction

> WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho
> 
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
> Environment: Java/Tika
>Reporter: Tony Bray
>Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-10 Thread Tilman Hausherr
Thanks for the test... the sum is still negative, but if we'd ignore the 
truncated files I bet we'd be positive.


I have downloaded a few of the regressions but won't create issues this 
time as yesterday's turned out to be duplicates, I'll wait for Andreas 
next commit and will create issues only if these aren't solved.

@Andreas - ping me if you didn't keep the "secret" URL.

Some misc thoughts...

039800.pdf: "refinery's" is a different token than refinery. Shouldn't 
"refinery's" be three tokens? I mention this because refinery is 
probably in a dictionary.


Some differences are because of a different treatment of the space in 
bad fonts. Some were improved, and some now look like this "C I T I E S 
W I T H O U T D R U G S". There is an open issue about these. It is 
tricky because if we treat these like 1 word, we'd also lose spaces 
where we don't want.


commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used 
http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z


Tilman

Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:

Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3316) Add comment to PDF

2017-05-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004650#comment-16004650
 ] 

Tilman Hausherr commented on PDFBOX-3316:
-

You can't use the parser, you would have to read the content stream yourself 
with PDPage.getContents().
In the future please use the user mailing list, as your question is only 
loosely related to the (closed) issue. 

> Add comment to PDF
> --
>
> Key: PDFBOX-3316
> URL: https://issues.apache.org/jira/browse/PDFBOX-3316
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 3.0.0
>Reporter: Jerrol Etheredge
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.2, 3.0.0
>
>
> For our application we use some comment texts (prepended by a %) to mark 
> content and perform text replacement.
> We currently use the appendRawCommands() method to add these, but since this 
> method has been marked as deprecated since version 2.0.
> Would it be possible to add some like a addComment() method to 
> PDPageContentStream?
> The code would probably be something trivial like:
> public void addComment(String comment) {
> output.write("%" + comment + "\n");
> }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3316) Add comment to PDF

2017-05-10 Thread Peter Pinnau (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004644#comment-16004644
 ] 

Peter Pinnau commented on PDFBOX-3316:
--

Is there a way to read such comments with PDFBox? I tried the PDFStreamParser 
but it seems to ignore % comments since they are not tokens.

I am searching for the possibility to identify content content streams which 
contain a certain comment and remove that streams from the document.


> Add comment to PDF
> --
>
> Key: PDFBOX-3316
> URL: https://issues.apache.org/jira/browse/PDFBOX-3316
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 3.0.0
>Reporter: Jerrol Etheredge
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.2, 3.0.0
>
>
> For our application we use some comment texts (prepended by a %) to mark 
> content and perform text replacement.
> We currently use the appendRawCommands() method to add these, but since this 
> method has been marked as deprecated since version 2.0.
> Would it be possible to add some like a addComment() method to 
> PDPageContentStream?
> The code would probably be something trivial like:
> public void addComment(String comment) {
> output.write("%" + comment + "\n");
> }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-10 Thread Andreas Lehmkühler

> "Allison, Timothy B."  hat am 10. Mai 2017 um 11:42 
> geschrieben:
> 
> 
> Haven't had a chance to look. Reports are here:
> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
Thanks again for running the report again

I had a quick look and there are 2 new exceptions. It seems to be a regression. 
I'm going to dig deeper later when I'm back home

Here a 2 sample pfs, one for each exception
commoncrawl2/YV/YVFDWHF767TEYTT7IVFSLUIJTDF3YP57
commoncrawl2/5W/5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL

Andreas

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3512) PDFDebugger Mac App

2017-05-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-3512:
---
Fix Version/s: (was: 2.0.6)
   2.0.7

> PDFDebugger Mac App
> ---
>
> Key: PDFBOX-3512
> URL: https://issues.apache.org/jira/browse/PDFBOX-3512
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Utilities
> Environment: Mac OS X
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Minor
> Fix For: 2.0.7, 3.0.0
>
>
> Using the PDFDebugger on the Mac isn't a great experience (see PDFBOX-3507). 
> We should package the jar into a native Mac .app bundle.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-10 Thread Allison, Timothy B.
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz