[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844041#comment-13844041
 ] 

Andreas Lehmkühler commented on PDFBOX-1792:


Hmmm, why did you disable the test? Everything works fine for me.

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> 
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-09 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-1792:


Attachment: testPDF_acroForm2.pdf

Classic parser can't extract metadata from testPDF_acroForm2.pdf, but 
NonSequentialPDFParser can extract metadata from it.

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> 
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-09 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-1792:


Description: The traditional parser is able to extract metadata from a test 
document from TIKA-738.  The NonSequentialPDFParser is not able to extract 
metadata from that file.  Another file from the Tika test suite has metadata 
that can be extracted by the NonSequentialPDFParser but not by classic.   (was: 
The traditional parser is able to extract metadata from the Annotation test 
document from TIKA-738.  The NonSequentialPDFParser is not able to extract 
metadata.)
Summary: Different metadata extracted with NonSequentialPDFParser vs 
classic parser on some documents  (was: Metadata not completely extracted with 
NonSequentialPDFParser on some documents)

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> 
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser

2013-12-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843914#comment-13843914
 ] 

Tim Allison commented on PDFBOX-1806:
-

Sorry about that.  It seemed like a different issue to me (in 1806 the fix 
should be in classic, whereas in 1792, the fix should be in NonSequential), but 
I see your point.  Will modify PDFBOX-1792 to describe a general "out of sync" 
issue and add the test file from this issue.  Thank you!

> Metadata not completely extracted by traditional parser, but is extracted by 
> NonSequentialParser
> 
>
> Key: PDFBOX-1806
> URL: https://issues.apache.org/jira/browse/PDFBOX-1806
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testPDF_acroForm2.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Fred Hansen (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fred Hansen updated PDFBOX-1803:


Attachment: PDFBOX-DateConverter-Trunk-fred.patch
PDFBOX-DateConverter-1.8-fred.patch

Proposed changes 
toCalendar(String) : return null for an empty string
toCalendar(String, String[]) : return dummy value for null (instead of 
returning null) ; added an example of supplying a format
toCalendar(COSString : strengthened the deprecation
parseDate directly tests for null and empty string

Improved JavaDoc for these methods

TestDateUtil
changed testExtract to testToCalendar and incorporated into it test for null 
and empty strings.  Test the new example of toCalendar(String, String[])
added tests for null and empty strings



> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBOX-DateConverter-1.8-fred.patch, 
> PDFBOX-DateConverter-Trunk-fred.patch, PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Problem commiting txt resources to svn

2013-12-09 Thread Thomas Chojecki

Hallo,
has anyone similar problems with committing text files? In my case I'm  
using a linux box and the file is UTF-16 LE encoded. I configured my  
SVN client as described in the beginners guide and added the content  
of http://www.apache.org/dev/svn-eol-style.txt to the config.


My svn client throws this shorten error
svn: E29: Kann »svn:eol-style« nicht setzen: Datei  
».../testAnnotations.pdf-sorted.txt« hat die MIME-Typ Eigenschaft  
»binär«


In english it should be something like this
svn: E29: File '.../testAnnotations.pdf-sorted.txt' has binary  
mime type property


Do I need to add svn:mime-type=text/plain or something else to the  
config for *.txt files?


Best regards
Thomas



[jira] [Commented] (PDFBOX-1792) Metadata not completely extracted with NonSequentialPDFParser on some documents

2013-12-09 Thread Thomas Chojecki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843625#comment-13843625
 ] 

Thomas Chojecki commented on PDFBOX-1792:
-

I've run the test and there are some parsing problems with the existing 
testfiles in both parsers. I will commit (pdfbox 1.8.x branch) and rename the 
test, so it will not be run automatically. Additionally it would be great to 
use JUnit 4 instead of 3. So such tests can be ignored using the @Ignore 
annotation. 


> Metadata not completely extracted with NonSequentialPDFParser on some 
> documents
> ---
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz
>
>
> The traditional parser is able to extract metadata from the Annotation test 
> document from TIKA-738.  The NonSequentialPDFParser is not able to extract 
> metadata.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Resolved] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser

2013-12-09 Thread Thomas Chojecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Chojecki resolved PDFBOX-1806.
-

Resolution: Duplicate

Please do not open new issues for known problems. Use the existing one for file 
upload.

Both parser working different and we try our best to keep them in sync. So if 
you have new files or maybe more informations, just comment in PDFBOX-1792 or 
edit the description if necessary.

> Metadata not completely extracted by traditional parser, but is extracted by 
> NonSequentialParser
> 
>
> Key: PDFBOX-1806
> URL: https://issues.apache.org/jira/browse/PDFBOX-1806
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testPDF_acroForm2.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Fred Hansen (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843539#comment-13843539
 ] 

Fred Hansen commented on PDFBOX-1803:
-

Upon careful consideration, a different change seems right for the new 
toCalendar(String, String[]). 

For discussion, we can construct a matrix of possible error conditions and 
treatments: For each condition (null, empty string, illegal date) what should 
be the response (exception, return null, return a dummy value with an illegal 
year)?

h3. toCalendar(String) _deprecated method_
For compatibility, as this bug report suggests, {{null}} should be returned for 
an empty string argument:
||   |*Condition*|*current result*|*proposed result*||
|   |For null input|{{null}}| 
|   |For empty string|-exception-|+{{null}}+| 
|   |For illegal date|exception|


h3. toCalendar (String, String[]) _replacement method_
Rather than ever producing a non-Calendar value, I propose to revise this 
method:
||   |*Condition*|*current result*|*proposed result*||
|   |For null input|-{{null}}-|+dummy date+|
|   |For empty string|dummy date|   
|   |For illegal date|dummy date|   

Unless there are objections, I will revise toCalendar(String, String[]) so it 
returns a dummy Calendar in all cases.





> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser

2013-12-09 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-1806:


Attachment: testPDF_acroForm2.pdf

Example file attached.  Test case code in PDFBOX-1792 reveals this issue.


> Metadata not completely extracted by traditional parser, but is extracted by 
> NonSequentialParser
> 
>
> Key: PDFBOX-1806
> URL: https://issues.apache.org/jira/browse/PDFBOX-1806
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testPDF_acroForm2.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser

2013-12-09 Thread Tim Allison (JIRA)
Tim Allison created PDFBOX-1806:
---

 Summary: Metadata not completely extracted by traditional parser, 
but is extracted by NonSequentialParser
 Key: PDFBOX-1806
 URL: https://issues.apache.org/jira/browse/PDFBOX-1806
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.3
Reporter: Tim Allison
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Building/enhancing a test suite for PDFBox

2013-12-09 Thread Maruan Sahyoun
Yes,

that’s my observation too. In addition Bavarian deals with positive documents 
too whereas Isartor only has false documents (from a PDF/A perspective). So 
it’s more generic. 

Maruan Sahyoun


Am 09.12.2013 um 17:12 schrieb Guillaume Bailleul :

> Hi,
> 
> what is in place for PDF/A validation is too specific, as you said, we
> only expect an error code (as we only validate isartor files). Bavaria
> Test suite contains a format where conforming and non conforming are
> handled, it is IMO a better source of inspiration.
> 
> BR,
> 
> Guillaume
> 
> On Mon, Dec 9, 2013 at 4:32 PM, Maruan Sahyoun  wrote:
>> Hi,
>> 
>> I fully agree that the target should be to have automated tests. wo that the 
>> benefit will be limited. As for error codes/messages we could 
>> reuse/generalize what’s in place for the PDF/A validator. Bavarian test 
>> suite from pdflib also has a good set of test/result descriptions.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 09.12.2013 um 16:00 schrieb Timo Boehme :
>> 
>>> Hi,
>>> 
>>> this would be a valuable resource, especially if the test can be automated 
>>> - thus we need to somehow specify the expected result (exception, warning, 
>>> result document/text) for automated processing. Maybe we should start using 
>>> error codes?
>>> 
>>> 
>>> Best,
>>> Timo
>>> 
>>> 
>>> 
>>> Am 08.12.2013 15:43, schrieb Maruan Sahyoun:
 Hi,
 
 as we are handling and closing issues using PDFs provided by users of the 
 library what do you think about adding these files to a test suite if 
 these can be used to check for a behavior of handling specific issues.
 
 The benefit would be that we can write tests around these issues to ensure 
 that forthcoming releases are still able to handle these files.
 
 An idea for a naming convention would be something like >>> number> e.g. 1769-invalid_xref.pdf
 
 WDYT
 
 Maruan Sahyoun
 
>>> 
>>> 
>>> --
>>> 
>>> Timo Boehme
>>> OntoChem GmbH
>>> H.-Damerow-Str. 4
>>> 06120 Halle/Saale
>>> T: +49 345 4780474
>>> F: +49 345 4780471
>>> timo.boe...@ontochem.com
>>> 
>>> _
>>> 
>>> OntoChem GmbH
>>> Geschäftsführer: Dr. Lutz Weber
>>> Sitz: Halle / Saale
>>> Registergericht: Stendal
>>> Registernummer: HRB 215461
>>> _
>>> 
>> 



Re: Building/enhancing a test suite for PDFBox

2013-12-09 Thread Guillaume Bailleul
Hi,

what is in place for PDF/A validation is too specific, as you said, we
only expect an error code (as we only validate isartor files). Bavaria
Test suite contains a format where conforming and non conforming are
handled, it is IMO a better source of inspiration.

BR,

Guillaume

On Mon, Dec 9, 2013 at 4:32 PM, Maruan Sahyoun  wrote:
> Hi,
>
> I fully agree that the target should be to have automated tests. wo that the 
> benefit will be limited. As for error codes/messages we could 
> reuse/generalize what’s in place for the PDF/A validator. Bavarian test suite 
> from pdflib also has a good set of test/result descriptions.
>
> BR
> Maruan Sahyoun
>
> Am 09.12.2013 um 16:00 schrieb Timo Boehme :
>
>> Hi,
>>
>> this would be a valuable resource, especially if the test can be automated - 
>> thus we need to somehow specify the expected result (exception, warning, 
>> result document/text) for automated processing. Maybe we should start using 
>> error codes?
>>
>>
>> Best,
>> Timo
>>
>>
>>
>> Am 08.12.2013 15:43, schrieb Maruan Sahyoun:
>>> Hi,
>>>
>>> as we are handling and closing issues using PDFs provided by users of the 
>>> library what do you think about adding these files to a test suite if these 
>>> can be used to check for a behavior of handling specific issues.
>>>
>>> The benefit would be that we can write tests around these issues to ensure 
>>> that forthcoming releases are still able to handle these files.
>>>
>>> An idea for a naming convention would be something like >> number> e.g. 1769-invalid_xref.pdf
>>>
>>> WDYT
>>>
>>> Maruan Sahyoun
>>>
>>
>>
>> --
>>
>> Timo Boehme
>> OntoChem GmbH
>> H.-Damerow-Str. 4
>> 06120 Halle/Saale
>> T: +49 345 4780474
>> F: +49 345 4780471
>> timo.boe...@ontochem.com
>>
>> _
>>
>> OntoChem GmbH
>> Geschäftsführer: Dr. Lutz Weber
>> Sitz: Halle / Saale
>> Registergericht: Stendal
>> Registernummer: HRB 215461
>> _
>>
>


[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Fred Hansen (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843250#comment-13843250
 ] 

Fred Hansen commented on PDFBOX-1803:
-

Yes, I missed the empty string case in toCalendar(String).

The proposed patch is fine, as far as it goes. I will produce an extended patch 
that incorporates the proposal, amends the JavaDoc, and also does both for the 
new toConverter(String, String[])

In addition, I'll add words to the JavaDoc for toConverter(COSString). This 
method needs to be completely removed if DateConverter is to be part of a 
utility package that does not depend on com.apache.pdfbox.



> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Building/enhancing a test suite for PDFBox

2013-12-09 Thread Maruan Sahyoun
Hi,

I fully agree that the target should be to have automated tests. wo that the 
benefit will be limited. As for error codes/messages we could reuse/generalize 
what’s in place for the PDF/A validator. Bavarian test suite from pdflib also 
has a good set of test/result descriptions.

BR
Maruan Sahyoun

Am 09.12.2013 um 16:00 schrieb Timo Boehme :

> Hi,
> 
> this would be a valuable resource, especially if the test can be automated - 
> thus we need to somehow specify the expected result (exception, warning, 
> result document/text) for automated processing. Maybe we should start using 
> error codes?
> 
> 
> Best,
> Timo
> 
> 
> 
> Am 08.12.2013 15:43, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> as we are handling and closing issues using PDFs provided by users of the 
>> library what do you think about adding these files to a test suite if these 
>> can be used to check for a behavior of handling specific issues.
>> 
>> The benefit would be that we can write tests around these issues to ensure 
>> that forthcoming releases are still able to handle these files.
>> 
>> An idea for a naming convention would be something like > description> e.g. 1769-invalid_xref.pdf
>> 
>> WDYT
>> 
>> Maruan Sahyoun
>> 
> 
> 
> -- 
> 
> Timo Boehme
> OntoChem GmbH
> H.-Damerow-Str. 4
> 06120 Halle/Saale
> T: +49 345 4780474
> F: +49 345 4780471
> timo.boe...@ontochem.com
> 
> _
> 
> OntoChem GmbH
> Geschäftsführer: Dr. Lutz Weber
> Sitz: Halle / Saale
> Registergericht: Stendal
> Registernummer: HRB 215461
> _
> 



[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843232#comment-13843232
 ] 

Tilman Hausherr commented on PDFBOX-1803:
-

You might want to ask Fred Hansen, see PDFBOX-1633.

> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Building/enhancing a test suite for PDFBox

2013-12-09 Thread Timo Boehme

Hi,

this would be a valuable resource, especially if the test can be 
automated - thus we need to somehow specify the expected result 
(exception, warning, result document/text) for automated processing. 
Maybe we should start using error codes?



Best,
Timo



Am 08.12.2013 15:43, schrieb Maruan Sahyoun:

Hi,

as we are handling and closing issues using PDFs provided by users of the 
library what do you think about adding these files to a test suite if these can 
be used to check for a behavior of handling specific issues.

The benefit would be that we can write tests around these issues to ensure that 
forthcoming releases are still able to handle these files.

An idea for a naming convention would be something like  e.g. 1769-invalid_xref.pdf

WDYT

Maruan Sahyoun




--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



[jira] [Updated] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space

2013-12-09 Thread Andy Phillips (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Phillips updated PDFBOX-1805:
--

Description: 
I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.

I suggest removing the following change in PDFTextStripper file (i commented 
out the last two requirements from the if statement):

   //Test if our TextPosition starts after a new word would be 
expected to start.
if (expectedStartOfNextWordX != 
EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
&& expectedStartOfNextWordX < positionX) /* &&
//only bother adding a space if the last character 
was not a space
lastPosition.getTextPosition().getCharacter() != 
null &&

!lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
{
line.add(WordSeparator.getSeparator());
}



  was:
I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.





> PDFTextStripper, add word segment even if the last word is a space
> --
>
> Key: PDFBOX-1805
> URL: https://issues.apache.org/jira/browse/PDFBOX-1805
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.3
>Reporter: Andy Phillips
> Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is 
> greater than expected for a space in the "line" normalization, causes text 
> "fields" that should be separated (as they are not really part of the 
> paragraph) to be improperly added to the line of text.  
> In the attached pdf, i have found that looking at the first line of the first 
> violation of code, that the "Corrected By" date is incorrectly added to the 
> same line of Description of Violation.   This is due to the fact that the 
> first line of "Description of Violation" ends with a space.   This is due to 
> word wrapping of the paragraph when it was generated and i believe that if 
> the next letter in the line is greater than an expected space, regardless if 
> the last line ends in a space, it should be considered a second segment.
> I suggest removing the following change in PDFTextStripper file (i commented 
> out the last two requirements from the if statement):
>//Test if our TextPosition starts after a new word would 
> be expected to start.
> if (expectedStartOfNextWordX != 
> EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
> && expectedStartOfNextWordX < positionX) /* &&
> //only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getCharacter() != 
> null &&
> 
> !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
> {
> line.add(WordSeparator.getSeparator());
> }



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space

2013-12-09 Thread Andy Phillips (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Phillips updated PDFBOX-1805:
--

Attachment: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf

> PDFTextStripper, add word segment even if the last word is a space
> --
>
> Key: PDFBOX-1805
> URL: https://issues.apache.org/jira/browse/PDFBOX-1805
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.3
>Reporter: Andy Phillips
> Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is 
> greater than expected for a space in the "line" normalization, causes text 
> "fields" that should be separated (as they are not really part of the 
> paragraph) to be improperly added to the line of text.  
> In the attached pdf, i have found that looking at the first line of the first 
> violation of code, that the "Corrected By" date is incorrectly added to the 
> same line of Description of Violation.   This is due to the fact that the 
> first line of "Description of Violation" ends with a space.   This is due to 
> word wrapping of the paragraph when it was generated and i believe that if 
> the next letter in the line is greater than an expected space, regardless if 
> the last line ends in a space, it should be considered a second segment.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space

2013-12-09 Thread Andy Phillips (JIRA)
Andy Phillips created PDFBOX-1805:
-

 Summary: PDFTextStripper, add word segment even if the last word 
is a space
 Key: PDFBOX-1805
 URL: https://issues.apache.org/jira/browse/PDFBOX-1805
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.3
Reporter: Andy Phillips


I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1804) PDFTextStripper Issue related to word positions not correctly being parsed

2013-12-09 Thread Andy Phillips (JIRA)
Andy Phillips created PDFBOX-1804:
-

 Summary: PDFTextStripper Issue related to word positions not 
correctly being parsed
 Key: PDFBOX-1804
 URL: https://issues.apache.org/jira/browse/PDFBOX-1804
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.3
Reporter: Andy Phillips


I found in a PDF I was pulling text from by using a custom PDFTextStripper 
subclass that overrides writeString(String text, List 
textPositions) that i was getting the wrong textPositions that were not lined 
up with the text.   I found that the test position of all “words” in a line 
always come over as the “last” text positions of the last word in the line.   I 
found the issue in the PDFTextStripper class

So here is the Code Issue:

/**
 * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
 * @return The StringBuilder that must be used when calling this method.
 */
private StringBuilder normalizeAdd(LinkedList 
normalized,
StringBuilder lineBuilder, List wordPositions, 
TextPosition text)
{
if (text instanceof WordSeparator) 
{
normalized.add(createWord(lineBuilder.toString(), wordPositions));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else 
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}


When the normalizeAdd method, you create a new word passing the wordPositions.  
 A reference to the wordPositions is stored in the new WordWithTextPositions in 
the normalized linked list, but in the next line, you clear().   Since the last 
wordPositions was passed as a reference, the wordPositions is cleared in the 
WordWithTextPositions you just created.

Soo, i would suggest you do the following:
/**
 * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
 * @return The StringBuilder that must be used when calling this method.
 */
private StringBuilder normalizeAdd(LinkedList 
normalized,
StringBuilder lineBuilder, List wordPositions, 
TextPosition text)
{
if (text instanceof WordSeparator) 
{
normalized.add(createWord(lineBuilder.toString(), new 
ArrayList(wordPositions)));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else 
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Eric Leleu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Leleu updated PDFBOX-1803:
---

Description: 
Some PDF have an empty string as CreationDate &  ModDate in the Information 
Dictionary.

According to the PDF specification, this two element are optional.

My first fix was to test the null & the empty string in the toCalendar(String, 
String[]) method and I return null if one of the both condition is verified.
But according to a test case(TestDateUtil) a NullPointer is expected on null 
value of text. Can you explain why this behaviour has been adopted?

To fixe this unexpected exception in my execution path, I have added a test on 
the empty string in the deprecated method toCalendar(String). (Patch in 
attachment)

I'm waiting your comment before commit this patch (or change it by my first 
implementation)

BR,
Eric

  was:
Some PDF have an empty string as CreationDate &  ModDate in the Information 
Dictionary.

According to the PDF specification, this two element are optional.

My first fix was to test the null & the empty string in the toCalendar(String, 
String[]) method and I return null if one of the both condition is verified.
But according to a test case(TestDateUtil) a NullPointer is expected on null 
value of text. Can you explain why this behaviour has been adopted?

To fixe this unexpected exception in my execution path, I have added a test on 
the empty string in the deprecated method toCalendar(String). (Patch in 
attachment)

I'm waiting your comment before commit this patch (or change it by my first 
implementation)


> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Eric Leleu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Leleu updated PDFBOX-1803:
---

Attachment: PDFBox-DateConverter-Trunk.patch
PDFBox-DateConverter-Br18.patch

> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-09 Thread Eric Leleu (JIRA)
Eric Leleu created PDFBOX-1803:
--

 Summary: StringIndexOutOfBound on DateConverter.toCalendar
 Key: PDFBOX-1803
 URL: https://issues.apache.org/jira/browse/PDFBOX-1803
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel, Utilities
Affects Versions: 1.8.3
Reporter: Eric Leleu
Priority: Minor
 Attachments: PDFBox-DateConverter-Br18.patch, 
PDFBox-DateConverter-Trunk.patch

Some PDF have an empty string as CreationDate &  ModDate in the Information 
Dictionary.

According to the PDF specification, this two element are optional.

My first fix was to test the null & the empty string in the toCalendar(String, 
String[]) method and I return null if one of the both condition is verified.
But according to a test case(TestDateUtil) a NullPointer is expected on null 
value of text. Can you explain why this behaviour has been adopted?

To fixe this unexpected exception in my execution path, I have added a test on 
the empty string in the deprecated method toCalendar(String). (Patch in 
attachment)

I'm waiting your comment before commit this patch (or change it by my first 
implementation)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Summary: COSDictionary in COSArray setDirect(true) but dic written indirect 
 (was: COSDictionary in COSArray both setDirect(true) but dic written indirect)

> COSDictionary in COSArray setDirect(true) but dic written indirect
> --
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   dic.setItem...
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1769) Fix crash on invalid xref

2013-12-09 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843017#comment-13843017
 ] 

William Palmer commented on PDFBOX-1769:


Hi Andreas,

Thanks for taking the time to look at this file and add a fix.  Sorry the pdf 
file is corrupt :-S

Regards

Will

> Fix crash on invalid xref
> -
>
> Key: PDFBOX-1769
> URL: https://issues.apache.org/jira/browse/PDFBOX-1769
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Affects Versions: 1.8.2
>Reporter: William Palmer
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.4, 2.0.0
>
>
> Need to search for a correct xref start address
> Example file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf
> Exception in thread "main" java.io.IOException: Error: Expected an integer 
> type, actual='ref'
> at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
> Using the code:
> PDFTextStripper ts = new PDFTextStripper();
> PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt")));
> RandomAccess scratchFile = new 
> RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw");
> PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile)
> ts.setForceParsing(true);
> ts.writeText(doc, out); 
> Related: PDFBOX-1757



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
dic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Dictionary in array is indirect.

  was:
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
dic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct (in parent dictionary), but dictionary in array is indirect.


> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   dic.setItem...
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
dic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct (in parent dictionary), but dictionary in array is indirect.

  was:
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
dic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.


> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   dic.setItem...
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Array is direct (in parent dictionary), but dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.

  was:
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

// Add SigRef to Signature dictionary
COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.


> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   sigRefDic.setItem...
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Array is direct, but dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

// Add SigRef to Signature dictionary
COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.

  was:
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

// Add SigRef to Signature dictionary
COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);


> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   sigRefDic.setItem...
>   
>   // Add SigRef to Signature dictionary
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Array is direct, but dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
dic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.

  was:
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

Array is direct, but dictionary in array is indirect.


> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   dic.setItem...
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);
> Array is direct, but dictionary in array is indirect.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Summary: COSDictionary in COSArray both setDirect(true) but dic written 
indirect  (was: COSArray setDirect(true) but array written indirect)

> COSDictionary in COSArray both setDirect(true) but dic written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   sigRefDic.setItem...
>   
>   // Add SigRef to Signature dictionary
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (PDFBOX-1802) COSArray setDirect(true) but array written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Description: 
COSDictionary dic = new COSDictionary();
dic.setDirect(true);
sigRefDic.setItem...

// Add SigRef to Signature dictionary
COSArray array = new COSArray();
array.setDirect(true);
array.add(dic);

> COSArray setDirect(true) but array written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>
>   COSDictionary dic = new COSDictionary();
>   dic.setDirect(true);
>   sigRefDic.setItem...
>   
>   // Add SigRef to Signature dictionary
>   COSArray array = new COSArray();
>   array.setDirect(true);
>   array.add(dic);



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: [DISCUSS] PDFParser

2013-12-09 Thread Timo Boehme

Hi,

Am 07.12.2013 13:39, schrieb Maruan Sahyoun:

i (re-) started working on the new PDFParser. The PDFLexer as a foundation - 
together with some tests - is ready so far. Might need some more improvements 
moving forward.


Good news :-)


I'm currently working on the first part of the parser implementation
which is a 'non caching' parser. It generates PD and COS level
objects but only keeps the necessary minimum. e.g. Xref, Trailer ..
but doesn't keep pages, resources … in memory. And on top of that a
"caching" parser which keeps what has being parsed. I don't know if
that's doable but the idea is that applications like merging or
splitting pdfs could benefit from a 'non caching' parser.


Caching could be done using SoftReference - thus it might not be 
necessary to have the extra level. Nevertheless I can think of 
situations where the different behavior could be of benefit thus maybe 
the parser should be abstracted (interface etc.) allowing different 
implementations.



The pure COS level parsing is done (e.g. generating a COS Dictionary
form tokens) but there are some additional things needed around
higher level structures e.g. linearized PDFs. Initially the parser
reuses most of the existing classes where possible. Unfortunately
e.g. the COS level classes don't have a common set of methods for
instantiating these.

Question:  Can we agree on how objects are instantiated. e.g. 
Obj.getInstance(token) or new Obj(token) ...


I don't have a specific preference but the factory mentioned by 
Guillaume is a good idea.



This only makes sense if the objects themselves like pages or
resources can be fully cloned so that if objects are cloned or
imported they no longer have a dependency to the original object.
This could benefit PDF merging as one could close a no longer needed
PDF. This will affect the current PD Model I think.

Question:  Can we already clone, what needs to be done to fulfill that? Could 
we do a importPage() so the imported one is completely independent (and stored 
in memory or in a file based cache)?


I'm not sure but I think a deep clone is not supported today.


As the parser parses the PDF I think about firing events e.g. to
react on malformed PDFs. I consider this to be a better approach than
overwriting methods or putting workarounds into the core code.


I think to see what works best would be to take some workaround examples 
we (should) have now (e.g. finding real object start (looking 
back/forth), determining length of a stream or even use information from 
scanning file sequentially for object start points) and see how that 
could be realized with the event or another approach. At least to me it 
seems that these workarounds need to work quite close to the parser so 
in case of events the handler need to get access to low level functionality.



What about setting up a sandbox to share some initial code wo cluttering the 
current trunk.


A separate branch for developing the parser until a useable state would 
be good.



Best,
Timo

--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



[jira] [Updated] (PDFBOX-1802) COSArray setDirect(true) but array written indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cedomir Suljagic updated PDFBOX-1802:
-

Summary: COSArray setDirect(true) but array written indirect  (was: 
COSArray setDirect(true) makes array indirect)

> COSArray setDirect(true) but array written indirect
> ---
>
> Key: PDFBOX-1802
> URL: https://issues.apache.org/jira/browse/PDFBOX-1802
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 1.8.2
>Reporter: Cedomir Suljagic
>  Labels: cosarray, setdirect
>




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1802) COSArray setDirect(true) makes array indirect

2013-12-09 Thread Cedomir Suljagic (JIRA)
Cedomir Suljagic created PDFBOX-1802:


 Summary: COSArray setDirect(true) makes array indirect
 Key: PDFBOX-1802
 URL: https://issues.apache.org/jira/browse/PDFBOX-1802
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.8.2
Reporter: Cedomir Suljagic






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)