[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents
[ https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844041#comment-13844041 ] Andreas Lehmkühler commented on PDFBOX-1792: Hmmm, why did you disable the test? Everything works fine for me. > Different metadata extracted with NonSequentialPDFParser vs classic parser on > some documents > > > Key: PDFBOX-1792 > URL: https://issues.apache.org/jira/browse/PDFBOX-1792 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf > > > The traditional parser is able to extract metadata from a test document from > TIKA-738. The NonSequentialPDFParser is not able to extract metadata from > that file. Another file from the Tika test suite has metadata that can be > extracted by the NonSequentialPDFParser but not by classic. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents
[ https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-1792: Attachment: testPDF_acroForm2.pdf Classic parser can't extract metadata from testPDF_acroForm2.pdf, but NonSequentialPDFParser can extract metadata from it. > Different metadata extracted with NonSequentialPDFParser vs classic parser on > some documents > > > Key: PDFBOX-1792 > URL: https://issues.apache.org/jira/browse/PDFBOX-1792 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf > > > The traditional parser is able to extract metadata from a test document from > TIKA-738. The NonSequentialPDFParser is not able to extract metadata from > that file. Another file from the Tika test suite has metadata that can be > extracted by the NonSequentialPDFParser but not by classic. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents
[ https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-1792: Description: The traditional parser is able to extract metadata from a test document from TIKA-738. The NonSequentialPDFParser is not able to extract metadata from that file. Another file from the Tika test suite has metadata that can be extracted by the NonSequentialPDFParser but not by classic. (was: The traditional parser is able to extract metadata from the Annotation test document from TIKA-738. The NonSequentialPDFParser is not able to extract metadata.) Summary: Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents (was: Metadata not completely extracted with NonSequentialPDFParser on some documents) > Different metadata extracted with NonSequentialPDFParser vs classic parser on > some documents > > > Key: PDFBOX-1792 > URL: https://issues.apache.org/jira/browse/PDFBOX-1792 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-1792.tar.gz > > > The traditional parser is able to extract metadata from a test document from > TIKA-738. The NonSequentialPDFParser is not able to extract metadata from > that file. Another file from the Tika test suite has metadata that can be > extracted by the NonSequentialPDFParser but not by classic. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser
[ https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843914#comment-13843914 ] Tim Allison commented on PDFBOX-1806: - Sorry about that. It seemed like a different issue to me (in 1806 the fix should be in classic, whereas in 1792, the fix should be in NonSequential), but I see your point. Will modify PDFBOX-1792 to describe a general "out of sync" issue and add the test file from this issue. Thank you! > Metadata not completely extracted by traditional parser, but is extracted by > NonSequentialParser > > > Key: PDFBOX-1806 > URL: https://issues.apache.org/jira/browse/PDFBOX-1806 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: testPDF_acroForm2.pdf > > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fred Hansen updated PDFBOX-1803: Attachment: PDFBOX-DateConverter-Trunk-fred.patch PDFBOX-DateConverter-1.8-fred.patch Proposed changes toCalendar(String) : return null for an empty string toCalendar(String, String[]) : return dummy value for null (instead of returning null) ; added an example of supplying a format toCalendar(COSString : strengthened the deprecation parseDate directly tests for null and empty string Improved JavaDoc for these methods TestDateUtil changed testExtract to testToCalendar and incorporated into it test for null and empty strings. Test the new example of toCalendar(String, String[]) added tests for null and empty strings > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBOX-DateConverter-1.8-fred.patch, > PDFBOX-DateConverter-Trunk-fred.patch, PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Problem commiting txt resources to svn
Hallo, has anyone similar problems with committing text files? In my case I'm using a linux box and the file is UTF-16 LE encoded. I configured my SVN client as described in the beginners guide and added the content of http://www.apache.org/dev/svn-eol-style.txt to the config. My svn client throws this shorten error svn: E29: Kann »svn:eol-style« nicht setzen: Datei ».../testAnnotations.pdf-sorted.txt« hat die MIME-Typ Eigenschaft »binär« In english it should be something like this svn: E29: File '.../testAnnotations.pdf-sorted.txt' has binary mime type property Do I need to add svn:mime-type=text/plain or something else to the config for *.txt files? Best regards Thomas
[jira] [Commented] (PDFBOX-1792) Metadata not completely extracted with NonSequentialPDFParser on some documents
[ https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843625#comment-13843625 ] Thomas Chojecki commented on PDFBOX-1792: - I've run the test and there are some parsing problems with the existing testfiles in both parsers. I will commit (pdfbox 1.8.x branch) and rename the test, so it will not be run automatically. Additionally it would be great to use JUnit 4 instead of 3. So such tests can be ignored using the @Ignore annotation. > Metadata not completely extracted with NonSequentialPDFParser on some > documents > --- > > Key: PDFBOX-1792 > URL: https://issues.apache.org/jira/browse/PDFBOX-1792 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-1792.tar.gz > > > The traditional parser is able to extract metadata from the Annotation test > document from TIKA-738. The NonSequentialPDFParser is not able to extract > metadata. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Resolved] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser
[ https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Chojecki resolved PDFBOX-1806. - Resolution: Duplicate Please do not open new issues for known problems. Use the existing one for file upload. Both parser working different and we try our best to keep them in sync. So if you have new files or maybe more informations, just comment in PDFBOX-1792 or edit the description if necessary. > Metadata not completely extracted by traditional parser, but is extracted by > NonSequentialParser > > > Key: PDFBOX-1806 > URL: https://issues.apache.org/jira/browse/PDFBOX-1806 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: testPDF_acroForm2.pdf > > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843539#comment-13843539 ] Fred Hansen commented on PDFBOX-1803: - Upon careful consideration, a different change seems right for the new toCalendar(String, String[]). For discussion, we can construct a matrix of possible error conditions and treatments: For each condition (null, empty string, illegal date) what should be the response (exception, return null, return a dummy value with an illegal year)? h3. toCalendar(String) _deprecated method_ For compatibility, as this bug report suggests, {{null}} should be returned for an empty string argument: || |*Condition*|*current result*|*proposed result*|| | |For null input|{{null}}| | |For empty string|-exception-|+{{null}}+| | |For illegal date|exception| h3. toCalendar (String, String[]) _replacement method_ Rather than ever producing a non-Calendar value, I propose to revise this method: || |*Condition*|*current result*|*proposed result*|| | |For null input|-{{null}}-|+dummy date+| | |For empty string|dummy date| | |For illegal date|dummy date| Unless there are objections, I will revise toCalendar(String, String[]) so it returns a dummy Calendar in all cases. > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser
[ https://issues.apache.org/jira/browse/PDFBOX-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-1806: Attachment: testPDF_acroForm2.pdf Example file attached. Test case code in PDFBOX-1792 reveals this issue. > Metadata not completely extracted by traditional parser, but is extracted by > NonSequentialParser > > > Key: PDFBOX-1806 > URL: https://issues.apache.org/jira/browse/PDFBOX-1806 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.3 >Reporter: Tim Allison >Priority: Minor > Attachments: testPDF_acroForm2.pdf > > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (PDFBOX-1806) Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser
Tim Allison created PDFBOX-1806: --- Summary: Metadata not completely extracted by traditional parser, but is extracted by NonSequentialParser Key: PDFBOX-1806 URL: https://issues.apache.org/jira/browse/PDFBOX-1806 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.3 Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: Building/enhancing a test suite for PDFBox
Yes, that’s my observation too. In addition Bavarian deals with positive documents too whereas Isartor only has false documents (from a PDF/A perspective). So it’s more generic. Maruan Sahyoun Am 09.12.2013 um 17:12 schrieb Guillaume Bailleul : > Hi, > > what is in place for PDF/A validation is too specific, as you said, we > only expect an error code (as we only validate isartor files). Bavaria > Test suite contains a format where conforming and non conforming are > handled, it is IMO a better source of inspiration. > > BR, > > Guillaume > > On Mon, Dec 9, 2013 at 4:32 PM, Maruan Sahyoun wrote: >> Hi, >> >> I fully agree that the target should be to have automated tests. wo that the >> benefit will be limited. As for error codes/messages we could >> reuse/generalize what’s in place for the PDF/A validator. Bavarian test >> suite from pdflib also has a good set of test/result descriptions. >> >> BR >> Maruan Sahyoun >> >> Am 09.12.2013 um 16:00 schrieb Timo Boehme : >> >>> Hi, >>> >>> this would be a valuable resource, especially if the test can be automated >>> - thus we need to somehow specify the expected result (exception, warning, >>> result document/text) for automated processing. Maybe we should start using >>> error codes? >>> >>> >>> Best, >>> Timo >>> >>> >>> >>> Am 08.12.2013 15:43, schrieb Maruan Sahyoun: Hi, as we are handling and closing issues using PDFs provided by users of the library what do you think about adding these files to a test suite if these can be used to check for a behavior of handling specific issues. The benefit would be that we can write tests around these issues to ensure that forthcoming releases are still able to handle these files. An idea for a naming convention would be something like >>> number> e.g. 1769-invalid_xref.pdf WDYT Maruan Sahyoun >>> >>> >>> -- >>> >>> Timo Boehme >>> OntoChem GmbH >>> H.-Damerow-Str. 4 >>> 06120 Halle/Saale >>> T: +49 345 4780474 >>> F: +49 345 4780471 >>> timo.boe...@ontochem.com >>> >>> _ >>> >>> OntoChem GmbH >>> Geschäftsführer: Dr. Lutz Weber >>> Sitz: Halle / Saale >>> Registergericht: Stendal >>> Registernummer: HRB 215461 >>> _ >>> >>
Re: Building/enhancing a test suite for PDFBox
Hi, what is in place for PDF/A validation is too specific, as you said, we only expect an error code (as we only validate isartor files). Bavaria Test suite contains a format where conforming and non conforming are handled, it is IMO a better source of inspiration. BR, Guillaume On Mon, Dec 9, 2013 at 4:32 PM, Maruan Sahyoun wrote: > Hi, > > I fully agree that the target should be to have automated tests. wo that the > benefit will be limited. As for error codes/messages we could > reuse/generalize what’s in place for the PDF/A validator. Bavarian test suite > from pdflib also has a good set of test/result descriptions. > > BR > Maruan Sahyoun > > Am 09.12.2013 um 16:00 schrieb Timo Boehme : > >> Hi, >> >> this would be a valuable resource, especially if the test can be automated - >> thus we need to somehow specify the expected result (exception, warning, >> result document/text) for automated processing. Maybe we should start using >> error codes? >> >> >> Best, >> Timo >> >> >> >> Am 08.12.2013 15:43, schrieb Maruan Sahyoun: >>> Hi, >>> >>> as we are handling and closing issues using PDFs provided by users of the >>> library what do you think about adding these files to a test suite if these >>> can be used to check for a behavior of handling specific issues. >>> >>> The benefit would be that we can write tests around these issues to ensure >>> that forthcoming releases are still able to handle these files. >>> >>> An idea for a naming convention would be something like >> number> e.g. 1769-invalid_xref.pdf >>> >>> WDYT >>> >>> Maruan Sahyoun >>> >> >> >> -- >> >> Timo Boehme >> OntoChem GmbH >> H.-Damerow-Str. 4 >> 06120 Halle/Saale >> T: +49 345 4780474 >> F: +49 345 4780471 >> timo.boe...@ontochem.com >> >> _ >> >> OntoChem GmbH >> Geschäftsführer: Dr. Lutz Weber >> Sitz: Halle / Saale >> Registergericht: Stendal >> Registernummer: HRB 215461 >> _ >> >
[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843250#comment-13843250 ] Fred Hansen commented on PDFBOX-1803: - Yes, I missed the empty string case in toCalendar(String). The proposed patch is fine, as far as it goes. I will produce an extended patch that incorporates the proposal, amends the JavaDoc, and also does both for the new toConverter(String, String[]) In addition, I'll add words to the JavaDoc for toConverter(COSString). This method needs to be completely removed if DateConverter is to be part of a utility package that does not depend on com.apache.pdfbox. > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: Building/enhancing a test suite for PDFBox
Hi, I fully agree that the target should be to have automated tests. wo that the benefit will be limited. As for error codes/messages we could reuse/generalize what’s in place for the PDF/A validator. Bavarian test suite from pdflib also has a good set of test/result descriptions. BR Maruan Sahyoun Am 09.12.2013 um 16:00 schrieb Timo Boehme : > Hi, > > this would be a valuable resource, especially if the test can be automated - > thus we need to somehow specify the expected result (exception, warning, > result document/text) for automated processing. Maybe we should start using > error codes? > > > Best, > Timo > > > > Am 08.12.2013 15:43, schrieb Maruan Sahyoun: >> Hi, >> >> as we are handling and closing issues using PDFs provided by users of the >> library what do you think about adding these files to a test suite if these >> can be used to check for a behavior of handling specific issues. >> >> The benefit would be that we can write tests around these issues to ensure >> that forthcoming releases are still able to handle these files. >> >> An idea for a naming convention would be something like > description> e.g. 1769-invalid_xref.pdf >> >> WDYT >> >> Maruan Sahyoun >> > > > -- > > Timo Boehme > OntoChem GmbH > H.-Damerow-Str. 4 > 06120 Halle/Saale > T: +49 345 4780474 > F: +49 345 4780471 > timo.boe...@ontochem.com > > _ > > OntoChem GmbH > Geschäftsführer: Dr. Lutz Weber > Sitz: Halle / Saale > Registergericht: Stendal > Registernummer: HRB 215461 > _ >
[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843232#comment-13843232 ] Tilman Hausherr commented on PDFBOX-1803: - You might want to ask Fred Hansen, see PDFBOX-1633. > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: Building/enhancing a test suite for PDFBox
Hi, this would be a valuable resource, especially if the test can be automated - thus we need to somehow specify the expected result (exception, warning, result document/text) for automated processing. Maybe we should start using error codes? Best, Timo Am 08.12.2013 15:43, schrieb Maruan Sahyoun: Hi, as we are handling and closing issues using PDFs provided by users of the library what do you think about adding these files to a test suite if these can be used to check for a behavior of handling specific issues. The benefit would be that we can write tests around these issues to ensure that forthcoming releases are still able to handle these files. An idea for a naming convention would be something like e.g. 1769-invalid_xref.pdf WDYT Maruan Sahyoun -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
[jira] [Updated] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space
[ https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Phillips updated PDFBOX-1805: -- Description: I found that, in some PDFs, not injecting a WordSpacing in a line that is greater than expected for a space in the "line" normalization, causes text "fields" that should be separated (as they are not really part of the paragraph) to be improperly added to the line of text. In the attached pdf, i have found that looking at the first line of the first violation of code, that the "Corrected By" date is incorrectly added to the same line of Description of Violation. This is due to the fact that the first line of "Description of Violation" ends with a space. This is due to word wrapping of the paragraph when it was generated and i believe that if the next letter in the line is greater than an expected space, regardless if the last line ends in a space, it should be considered a second segment. I suggest removing the following change in PDFTextStripper file (i commented out the last two requirements from the if statement): //Test if our TextPosition starts after a new word would be expected to start. if (expectedStartOfNextWordX != EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE && expectedStartOfNextWordX < positionX) /* && //only bother adding a space if the last character was not a space lastPosition.getTextPosition().getCharacter() != null && !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */ { line.add(WordSeparator.getSeparator()); } was: I found that, in some PDFs, not injecting a WordSpacing in a line that is greater than expected for a space in the "line" normalization, causes text "fields" that should be separated (as they are not really part of the paragraph) to be improperly added to the line of text. In the attached pdf, i have found that looking at the first line of the first violation of code, that the "Corrected By" date is incorrectly added to the same line of Description of Violation. This is due to the fact that the first line of "Description of Violation" ends with a space. This is due to word wrapping of the paragraph when it was generated and i believe that if the next letter in the line is greater than an expected space, regardless if the last line ends in a space, it should be considered a second segment. > PDFTextStripper, add word segment even if the last word is a space > -- > > Key: PDFBOX-1805 > URL: https://issues.apache.org/jira/browse/PDFBOX-1805 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.3 >Reporter: Andy Phillips > Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf > > > I found that, in some PDFs, not injecting a WordSpacing in a line that is > greater than expected for a space in the "line" normalization, causes text > "fields" that should be separated (as they are not really part of the > paragraph) to be improperly added to the line of text. > In the attached pdf, i have found that looking at the first line of the first > violation of code, that the "Corrected By" date is incorrectly added to the > same line of Description of Violation. This is due to the fact that the > first line of "Description of Violation" ends with a space. This is due to > word wrapping of the paragraph when it was generated and i believe that if > the next letter in the line is greater than an expected space, regardless if > the last line ends in a space, it should be considered a second segment. > I suggest removing the following change in PDFTextStripper file (i commented > out the last two requirements from the if statement): >//Test if our TextPosition starts after a new word would > be expected to start. > if (expectedStartOfNextWordX != > EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE > && expectedStartOfNextWordX < positionX) /* && > //only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getCharacter() != > null && > > !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */ > { > line.add(WordSeparator.getSeparator()); > } -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space
[ https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Phillips updated PDFBOX-1805: -- Attachment: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf > PDFTextStripper, add word segment even if the last word is a space > -- > > Key: PDFBOX-1805 > URL: https://issues.apache.org/jira/browse/PDFBOX-1805 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.3 >Reporter: Andy Phillips > Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf > > > I found that, in some PDFs, not injecting a WordSpacing in a line that is > greater than expected for a space in the "line" normalization, causes text > "fields" that should be separated (as they are not really part of the > paragraph) to be improperly added to the line of text. > In the attached pdf, i have found that looking at the first line of the first > violation of code, that the "Corrected By" date is incorrectly added to the > same line of Description of Violation. This is due to the fact that the > first line of "Description of Violation" ends with a space. This is due to > word wrapping of the paragraph when it was generated and i believe that if > the next letter in the line is greater than an expected space, regardless if > the last line ends in a space, it should be considered a second segment. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (PDFBOX-1805) PDFTextStripper, add word segment even if the last word is a space
Andy Phillips created PDFBOX-1805: - Summary: PDFTextStripper, add word segment even if the last word is a space Key: PDFBOX-1805 URL: https://issues.apache.org/jira/browse/PDFBOX-1805 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.3 Reporter: Andy Phillips I found that, in some PDFs, not injecting a WordSpacing in a line that is greater than expected for a space in the "line" normalization, causes text "fields" that should be separated (as they are not really part of the paragraph) to be improperly added to the line of text. In the attached pdf, i have found that looking at the first line of the first violation of code, that the "Corrected By" date is incorrectly added to the same line of Description of Violation. This is due to the fact that the first line of "Description of Violation" ends with a space. This is due to word wrapping of the paragraph when it was generated and i believe that if the next letter in the line is greater than an expected space, regardless if the last line ends in a space, it should be considered a second segment. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (PDFBOX-1804) PDFTextStripper Issue related to word positions not correctly being parsed
Andy Phillips created PDFBOX-1804: - Summary: PDFTextStripper Issue related to word positions not correctly being parsed Key: PDFBOX-1804 URL: https://issues.apache.org/jira/browse/PDFBOX-1804 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.3 Reporter: Andy Phillips I found in a PDF I was pulling text from by using a custom PDFTextStripper subclass that overrides writeString(String text, List textPositions) that i was getting the wrong textPositions that were not lined up with the text. I found that the test position of all “words” in a line always come over as the “last” text positions of the last word in the line. I found the issue in the PDFTextStripper class So here is the Code Issue: /** * Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}. * @return The StringBuilder that must be used when calling this method. */ private StringBuilder normalizeAdd(LinkedList normalized, StringBuilder lineBuilder, List wordPositions, TextPosition text) { if (text instanceof WordSeparator) { normalized.add(createWord(lineBuilder.toString(), wordPositions)); lineBuilder = new StringBuilder(); wordPositions.clear(); } else { lineBuilder.append(text.getCharacter()); wordPositions.add(text); } return lineBuilder; } When the normalizeAdd method, you create a new word passing the wordPositions. A reference to the wordPositions is stored in the new WordWithTextPositions in the normalized linked list, but in the next line, you clear(). Since the last wordPositions was passed as a reference, the wordPositions is cleared in the WordWithTextPositions you just created. Soo, i would suggest you do the following: /** * Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}. * @return The StringBuilder that must be used when calling this method. */ private StringBuilder normalizeAdd(LinkedList normalized, StringBuilder lineBuilder, List wordPositions, TextPosition text) { if (text instanceof WordSeparator) { normalized.add(createWord(lineBuilder.toString(), new ArrayList(wordPositions))); lineBuilder = new StringBuilder(); wordPositions.clear(); } else { lineBuilder.append(text.getCharacter()); wordPositions.add(text); } return lineBuilder; } -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Leleu updated PDFBOX-1803: --- Description: Some PDF have an empty string as CreationDate & ModDate in the Information Dictionary. According to the PDF specification, this two element are optional. My first fix was to test the null & the empty string in the toCalendar(String, String[]) method and I return null if one of the both condition is verified. But according to a test case(TestDateUtil) a NullPointer is expected on null value of text. Can you explain why this behaviour has been adopted? To fixe this unexpected exception in my execution path, I have added a test on the empty string in the deprecated method toCalendar(String). (Patch in attachment) I'm waiting your comment before commit this patch (or change it by my first implementation) BR, Eric was: Some PDF have an empty string as CreationDate & ModDate in the Information Dictionary. According to the PDF specification, this two element are optional. My first fix was to test the null & the empty string in the toCalendar(String, String[]) method and I return null if one of the both condition is verified. But according to a test case(TestDateUtil) a NullPointer is expected on null value of text. Can you explain why this behaviour has been adopted? To fixe this unexpected exception in my execution path, I have added a test on the empty string in the deprecated method toCalendar(String). (Patch in attachment) I'm waiting your comment before commit this patch (or change it by my first implementation) > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
[ https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Leleu updated PDFBOX-1803: --- Attachment: PDFBox-DateConverter-Trunk.patch PDFBox-DateConverter-Br18.patch > StringIndexOutOfBound on DateConverter.toCalendar > - > > Key: PDFBOX-1803 > URL: https://issues.apache.org/jira/browse/PDFBOX-1803 > Project: PDFBox > Issue Type: Bug > Components: PDModel, Utilities >Affects Versions: 1.8.3 >Reporter: Eric Leleu >Priority: Minor > Attachments: PDFBox-DateConverter-Br18.patch, > PDFBox-DateConverter-Trunk.patch > > > Some PDF have an empty string as CreationDate & ModDate in the Information > Dictionary. > According to the PDF specification, this two element are optional. > My first fix was to test the null & the empty string in the > toCalendar(String, String[]) method and I return null if one of the both > condition is verified. > But according to a test case(TestDateUtil) a NullPointer is expected on null > value of text. Can you explain why this behaviour has been adopted? > To fixe this unexpected exception in my execution path, I have added a test > on the empty string in the deprecated method toCalendar(String). (Patch in > attachment) > I'm waiting your comment before commit this patch (or change it by my first > implementation) > BR, > Eric -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar
Eric Leleu created PDFBOX-1803: -- Summary: StringIndexOutOfBound on DateConverter.toCalendar Key: PDFBOX-1803 URL: https://issues.apache.org/jira/browse/PDFBOX-1803 Project: PDFBox Issue Type: Bug Components: PDModel, Utilities Affects Versions: 1.8.3 Reporter: Eric Leleu Priority: Minor Attachments: PDFBox-DateConverter-Br18.patch, PDFBox-DateConverter-Trunk.patch Some PDF have an empty string as CreationDate & ModDate in the Information Dictionary. According to the PDF specification, this two element are optional. My first fix was to test the null & the empty string in the toCalendar(String, String[]) method and I return null if one of the both condition is verified. But according to a test case(TestDateUtil) a NullPointer is expected on null value of text. Can you explain why this behaviour has been adopted? To fixe this unexpected exception in my execution path, I have added a test on the empty string in the deprecated method toCalendar(String). (Patch in attachment) I'm waiting your comment before commit this patch (or change it by my first implementation) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Summary: COSDictionary in COSArray setDirect(true) but dic written indirect (was: COSDictionary in COSArray both setDirect(true) but dic written indirect) > COSDictionary in COSArray setDirect(true) but dic written indirect > -- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > dic.setItem... > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1769) Fix crash on invalid xref
[ https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843017#comment-13843017 ] William Palmer commented on PDFBOX-1769: Hi Andreas, Thanks for taking the time to look at this file and add a fix. Sorry the pdf file is corrupt :-S Regards Will > Fix crash on invalid xref > - > > Key: PDFBOX-1769 > URL: https://issues.apache.org/jira/browse/PDFBOX-1769 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Affects Versions: 1.8.2 >Reporter: William Palmer >Assignee: Andreas Lehmkühler > Fix For: 1.8.4, 2.0.0 > > > Need to search for a correct xref start address > Example file: > http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf > Exception in thread "main" java.io.IOException: Error: Expected an integer > type, actual='ref' > at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622) > Using the code: > PDFTextStripper ts = new PDFTextStripper(); > PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt"))); > RandomAccess scratchFile = new > RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw"); > PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile) > ts.setForceParsing(true); > ts.writeText(doc, out); > Related: PDFBOX-1757 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); dic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Dictionary in array is indirect. was: COSDictionary dic = new COSDictionary(); dic.setDirect(true); dic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct (in parent dictionary), but dictionary in array is indirect. > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > dic.setItem... > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); dic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct (in parent dictionary), but dictionary in array is indirect. was: COSDictionary dic = new COSDictionary(); dic.setDirect(true); dic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > dic.setItem... > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Array is direct (in parent dictionary), but dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. was: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... // Add SigRef to Signature dictionary COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > sigRefDic.setItem... > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Array is direct, but dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... // Add SigRef to Signature dictionary COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. was: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... // Add SigRef to Signature dictionary COSArray array = new COSArray(); array.setDirect(true); array.add(dic); > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > sigRefDic.setItem... > > // Add SigRef to Signature dictionary > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Array is direct, but dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); dic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. was: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... COSArray array = new COSArray(); array.setDirect(true); array.add(dic); Array is direct, but dictionary in array is indirect. > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > dic.setItem... > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); > Array is direct, but dictionary in array is indirect. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSDictionary in COSArray both setDirect(true) but dic written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Summary: COSDictionary in COSArray both setDirect(true) but dic written indirect (was: COSArray setDirect(true) but array written indirect) > COSDictionary in COSArray both setDirect(true) but dic written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > sigRefDic.setItem... > > // Add SigRef to Signature dictionary > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (PDFBOX-1802) COSArray setDirect(true) but array written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Description: COSDictionary dic = new COSDictionary(); dic.setDirect(true); sigRefDic.setItem... // Add SigRef to Signature dictionary COSArray array = new COSArray(); array.setDirect(true); array.add(dic); > COSArray setDirect(true) but array written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > > COSDictionary dic = new COSDictionary(); > dic.setDirect(true); > sigRefDic.setItem... > > // Add SigRef to Signature dictionary > COSArray array = new COSArray(); > array.setDirect(true); > array.add(dic); -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: [DISCUSS] PDFParser
Hi, Am 07.12.2013 13:39, schrieb Maruan Sahyoun: i (re-) started working on the new PDFParser. The PDFLexer as a foundation - together with some tests - is ready so far. Might need some more improvements moving forward. Good news :-) I'm currently working on the first part of the parser implementation which is a 'non caching' parser. It generates PD and COS level objects but only keeps the necessary minimum. e.g. Xref, Trailer .. but doesn't keep pages, resources … in memory. And on top of that a "caching" parser which keeps what has being parsed. I don't know if that's doable but the idea is that applications like merging or splitting pdfs could benefit from a 'non caching' parser. Caching could be done using SoftReference - thus it might not be necessary to have the extra level. Nevertheless I can think of situations where the different behavior could be of benefit thus maybe the parser should be abstracted (interface etc.) allowing different implementations. The pure COS level parsing is done (e.g. generating a COS Dictionary form tokens) but there are some additional things needed around higher level structures e.g. linearized PDFs. Initially the parser reuses most of the existing classes where possible. Unfortunately e.g. the COS level classes don't have a common set of methods for instantiating these. Question: Can we agree on how objects are instantiated. e.g. Obj.getInstance(token) or new Obj(token) ... I don't have a specific preference but the factory mentioned by Guillaume is a good idea. This only makes sense if the objects themselves like pages or resources can be fully cloned so that if objects are cloned or imported they no longer have a dependency to the original object. This could benefit PDF merging as one could close a no longer needed PDF. This will affect the current PD Model I think. Question: Can we already clone, what needs to be done to fulfill that? Could we do a importPage() so the imported one is completely independent (and stored in memory or in a file based cache)? I'm not sure but I think a deep clone is not supported today. As the parser parses the PDF I think about firing events e.g. to react on malformed PDFs. I consider this to be a better approach than overwriting methods or putting workarounds into the core code. I think to see what works best would be to take some workaround examples we (should) have now (e.g. finding real object start (looking back/forth), determining length of a stream or even use information from scanning file sequentially for object start points) and see how that could be realized with the event or another approach. At least to me it seems that these workarounds need to work quite close to the parser so in case of events the handler need to get access to low level functionality. What about setting up a sandbox to share some initial code wo cluttering the current trunk. A separate branch for developing the parser until a useable state would be good. Best, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
[jira] [Updated] (PDFBOX-1802) COSArray setDirect(true) but array written indirect
[ https://issues.apache.org/jira/browse/PDFBOX-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cedomir Suljagic updated PDFBOX-1802: - Summary: COSArray setDirect(true) but array written indirect (was: COSArray setDirect(true) makes array indirect) > COSArray setDirect(true) but array written indirect > --- > > Key: PDFBOX-1802 > URL: https://issues.apache.org/jira/browse/PDFBOX-1802 > Project: PDFBox > Issue Type: Bug > Components: Writing >Affects Versions: 1.8.2 >Reporter: Cedomir Suljagic > Labels: cosarray, setdirect > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (PDFBOX-1802) COSArray setDirect(true) makes array indirect
Cedomir Suljagic created PDFBOX-1802: Summary: COSArray setDirect(true) makes array indirect Key: PDFBOX-1802 URL: https://issues.apache.org/jira/browse/PDFBOX-1802 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.8.2 Reporter: Cedomir Suljagic -- This message was sent by Atlassian JIRA (v6.1.4#6159)