[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324 ] Andreas Meier edited comment on PDFBOX-2998 at 10/7/15 6:03 AM: I just wanted to fuel the discussion with my snippet. My intention is not to provide code that breaks an already great extraction engine ;) {quote} I'd even start a step before that {quote} Depends on what is possible at the lower Levels... I don't know if I am the right person to take part in that discussion any further, but I will try to provide the "simple view" on a higher level, to address the problem: - Might it be useful to hold some Information like "(Hello World)" in a (meta-)information store, so that pdfbox can later take the single characters and form the word again? (No fonttype or -size needed, just simple character matching based on position and Rotation...) - Would it make sense to check for fonttype and -size and just handle cases like chemical names? [~tboehme] are there any other reasons for different font/size in a word you know? was (Author: andreasmeier): I just wanted to fuel the discussion with my snippet. My intention is not to provide code that breaks an already great extraction engine ;) {quote} I'd even start a step before that {quote} Depends on what is possible at the lower Levels... I don't know if I am the right person to take part in that discussion any further, but I will try to provide the "simple view" on a higher level, to address the problem: - Might it be useful to hold some Information like "(Hello World)" in a (meta-)information store, so that pdfbox can later take the single characters and form the word again? (No fonttype or -size needed, just simple character matching based on position and Rotation...) - Would it make sense to check for fonttype and -size and just handle cases like checmical names ([~tboehme] are there any other reasons for different font/size in a word you know?) > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324 ] Andreas Meier commented on PDFBOX-2998: --- I just wanted to fuel the discussion with my snippet. My intention is not to provide code that breaks an already great extraction engine ;) {quote} I'd even start a step before that {quote} Depends on what is possible at the lower Levels... I don't know if I am the right person to take part in that discussion any further, but I will try to provide the "simple view" on a higher level, to address the problem: - Might it be useful to hold some Information like "(Hello World)" in a (meta-)information store, so that pdfbox can later take the single characters and form the word again? (No fonttype or -size needed, just simple character matching based on position and Rotation...) - Would it make sense to check for fonttype and -size and just handle cases like checmical names ([~tboehme] are there any other reasons for different font/size in a word you know?) > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945842#comment-14945842 ] ASF subversion and git services commented on PDFBOX-2340: - Commit 1707146 from [~msahyoun] in branch 'cmssite/trunk' [ https://svn.apache.org/r1707146 ] PDFBOX-2340: add info how to encrypt password > Overhaul PDFBox Documentation > - > > Key: PDFBOX-2340 > URL: https://issues.apache.org/jira/browse/PDFBOX-2340 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Critical > Attachments: Mockup-20140912.png, Mockup_Documentation.png > > > In oder to make it easier for users of PDFBox to work with the library there > shall be an enhanced documentation consisting of an introduction, API > references and more well documented examples and code snippets (Cookbook). > In order to make it easier to contribute the Cookbook shall be build > automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945815#comment-14945815 ] ASF subversion and git services commented on PDFBOX-2340: - Commit 1707143 from [~msahyoun] in branch 'cmssite/trunk' [ https://svn.apache.org/r1707143 ] PDFBOX-2340: update layout > Overhaul PDFBox Documentation > - > > Key: PDFBOX-2340 > URL: https://issues.apache.org/jira/browse/PDFBOX-2340 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Critical > Attachments: Mockup-20140912.png, Mockup_Documentation.png > > > In oder to make it easier for users of PDFBox to work with the library there > shall be an enhanced documentation consisting of an introduction, API > references and more well documented examples and code snippets (Cookbook). > In order to make it easier to contribute the Cookbook shall be build > automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945814#comment-14945814 ] ASF subversion and git services commented on PDFBOX-2340: - Commit 1707142 from [~msahyoun] in branch 'cmssite/trunk' [ https://svn.apache.org/r1707142 ] PDFBOX-2340: update layout > Overhaul PDFBox Documentation > - > > Key: PDFBOX-2340 > URL: https://issues.apache.org/jira/browse/PDFBOX-2340 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Critical > Attachments: Mockup-20140912.png, Mockup_Documentation.png > > > In oder to make it easier for users of PDFBox to work with the library there > shall be an enhanced documentation consisting of an introduction, API > references and more well documented examples and code snippets (Cookbook). > In order to make it easier to contribute the Cookbook shall be build > automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3008) Memory leak in preflight
[ https://issues.apache.org/jira/browse/PDFBOX-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3008: Description: PreflightParser has this: {code} public PreflightParser(DataSource dataSource) throws IOException { // TODO move file handling outside of the parser super(new RandomAccessBufferedFileInputStream(dataSource.getInputStream())); this.setLenient(false); this.originalDocument = dataSource; } {code} The TODO message looks like a design issue, but it is much worse: the RandomAccessBufferedFileInputStream is never closed, which results in the temp file not being deleted. The file parameter constructor has the same problem, i.e. that the RandomAccessBufferedFileInputStream object is not closed (no temp file there). was: PreflightParser has this: {code} public PreflightParser(DataSource dataSource) throws IOException { // TODO move file handling outside of the parser super(new RandomAccessBufferedFileInputStream(dataSource.getInputStream())); this.setLenient(false); this.originalDocument = dataSource; } {code} The TODO message looks like a design issue, but it is much worse: the RandomAccessBufferedFileInputStream is never closed. The file parameter constructor has the same problem. > Memory leak in preflight > > > Key: PDFBOX-3008 > URL: https://issues.apache.org/jira/browse/PDFBOX-3008 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > > PreflightParser has this: > {code} > public PreflightParser(DataSource dataSource) throws IOException > { > // TODO move file handling outside of the parser > super(new > RandomAccessBufferedFileInputStream(dataSource.getInputStream())); > this.setLenient(false); > this.originalDocument = dataSource; > } > {code} > The TODO message looks like a design issue, but it is much worse: the > RandomAccessBufferedFileInputStream is never closed, which results in the > temp file not being deleted. The file parameter constructor has the same > problem, i.e. that the RandomAccessBufferedFileInputStream object is not > closed (no temp file there). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945784#comment-14945784 ] ASF subversion and git services commented on PDFBOX-2340: - Commit 1707141 from [~msahyoun] in branch 'cmssite/trunk' [ https://svn.apache.org/r1707141 ] PDFBOX-2340: document how to update the Javadoc > Overhaul PDFBox Documentation > - > > Key: PDFBOX-2340 > URL: https://issues.apache.org/jira/browse/PDFBOX-2340 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Critical > Attachments: Mockup-20140912.png, Mockup_Documentation.png > > > In oder to make it easier for users of PDFBox to work with the library there > shall be an enhanced documentation consisting of an introduction, API > references and more well documented examples and code snippets (Cookbook). > In order to make it easier to contribute the Cookbook shall be build > automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945699#comment-14945699 ] ASF subversion and git services commented on PDFBOX-2340: - Commit 1707134 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707134 ] PDFBOX-2340: semi automate javadoc generation > Overhaul PDFBox Documentation > - > > Key: PDFBOX-2340 > URL: https://issues.apache.org/jira/browse/PDFBOX-2340 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Critical > Attachments: Mockup-20140912.png, Mockup_Documentation.png > > > In oder to make it easier for users of PDFBox to work with the library there > shall be an enhanced documentation consisting of an introduction, API > references and more well documented examples and code snippets (Cookbook). > In order to make it easier to contribute the Cookbook shall be build > automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945520#comment-14945520 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1707114 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707114 ] PDFBOX-2852: improve javadoc > Improve code quality (2) > > > Key: PDFBOX-2852 > URL: https://issues.apache.org/jira/browse/PDFBOX-2852 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > Attachments: winansiencoding.patch, winansiencoding2.patch > > > This is a longterm issue for the task to improve code quality, by using the > [SonarQube > report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], > hints in different IDEs, the FindBugs tool and other code quality tools. > This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient
[ https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3007: Description: The example shown in http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html passes a DataSource object. This results in the creation of a temporary file. The constructor with the DataSource makes only sense when working with URLs. (And that only if http is cached, because preflight does an openStream() for each PDF stream!) It would be better to replace {code} FileDataSource fd = new FileDataSource(args[0]); PreflightParser parser = new PreflightParser(fd); {code} with {code} PreflightParser parser = new PreflightParser(args[0]); {code} Edit: removed 2.0, as the example may have to change after solving PDFBOX-3007. was: The example shown in http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html passes a DataSource object. This results in the creation of a temporary file. The constructor with the DataSource makes only sense when working with URLs. (And that only if http is cached, because preflight does an openStream() for each PDF stream!) It would be better to replace {code} FileDataSource fd = new FileDataSource(args[0]); PreflightParser parser = new PreflightParser(fd); {code} with {code} PreflightParser parser = new PreflightParser(args[0]); {code} When working on that one, the example could also be copied to the 2.0 cookbook directory. > Preflight cookbook example is inefficient > - > > Key: PDFBOX-3007 > URL: https://issues.apache.org/jira/browse/PDFBOX-3007 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.10, 1.8.11 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 1.8.11 > > > The example shown in > http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html > passes a DataSource object. This results in the creation of a temporary file. > The constructor with the DataSource makes only sense when working with URLs. > (And that only if http is cached, because preflight does an openStream() for > each PDF stream!) > It would be better to replace > {code} > FileDataSource fd = new FileDataSource(args[0]); > PreflightParser parser = new PreflightParser(fd); > {code} > with > {code} > PreflightParser parser = new PreflightParser(args[0]); > {code} > Edit: removed 2.0, as the example may have to change after solving > PDFBOX-3007. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient
[ https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3007: Affects Version/s: (was: 2.0.0) > Preflight cookbook example is inefficient > - > > Key: PDFBOX-3007 > URL: https://issues.apache.org/jira/browse/PDFBOX-3007 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.10, 1.8.11 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 1.8.11 > > > The example shown in > http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html > passes a DataSource object. This results in the creation of a temporary file. > The constructor with the DataSource makes only sense when working with URLs. > (And that only if http is cached, because preflight does an openStream() for > each PDF stream!) > It would be better to replace > {code} > FileDataSource fd = new FileDataSource(args[0]); > PreflightParser parser = new PreflightParser(fd); > {code} > with > {code} > PreflightParser parser = new PreflightParser(args[0]); > {code} > When working on that one, the example could also be copied to the 2.0 > cookbook directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient
[ https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3007: Fix Version/s: (was: 2.0.0) > Preflight cookbook example is inefficient > - > > Key: PDFBOX-3007 > URL: https://issues.apache.org/jira/browse/PDFBOX-3007 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.10, 1.8.11 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 1.8.11 > > > The example shown in > http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html > passes a DataSource object. This results in the creation of a temporary file. > The constructor with the DataSource makes only sense when working with URLs. > (And that only if http is cached, because preflight does an openStream() for > each PDF stream!) > It would be better to replace > {code} > FileDataSource fd = new FileDataSource(args[0]); > PreflightParser parser = new PreflightParser(fd); > {code} > with > {code} > PreflightParser parser = new PreflightParser(args[0]); > {code} > When working on that one, the example could also be copied to the 2.0 > cookbook directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions
[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945490#comment-14945490 ] Maruan Sahyoun commented on PDFBOX-2252: The effort you put into that is of great help. No need to be sorry that it takes a little longer. > PDFTextStripper has problem with documents with mixed language directions > - > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.6, 2.0.0 >Reporter: Amir >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, > PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, > PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, > bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, > pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions
[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945494#comment-14945494 ] Maruan Sahyoun commented on PDFBOX-2252: Thanks a lot - very valuable resource to look for test candidates > PDFTextStripper has problem with documents with mixed language directions > - > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.6, 2.0.0 >Reporter: Amir >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, > PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, > PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, > bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, > pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3008) Memory leak in preflight
Tilman Hausherr created PDFBOX-3008: --- Summary: Memory leak in preflight Key: PDFBOX-3008 URL: https://issues.apache.org/jira/browse/PDFBOX-3008 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Reporter: Tilman Hausherr PreflightParser has this: {code} public PreflightParser(DataSource dataSource) throws IOException { // TODO move file handling outside of the parser super(new RandomAccessBufferedFileInputStream(dataSource.getInputStream())); this.setLenient(false); this.originalDocument = dataSource; } {code} The TODO message looks like a design issue, but it is much worse: the RandomAccessBufferedFileInputStream is never closed. The file parameter constructor has the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945474#comment-14945474 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1707110 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707110 ] PDFBOX-2852: improve javadoc; rename parameter; simplify code > Improve code quality (2) > > > Key: PDFBOX-2852 > URL: https://issues.apache.org/jira/browse/PDFBOX-2852 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > Attachments: winansiencoding.patch, winansiencoding2.patch > > > This is a longterm issue for the task to improve code quality, by using the > [SonarQube > report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], > hints in different IDEs, the FindBugs tool and other code quality tools. > This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945458#comment-14945458 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1707105 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707105 ] PDFBOX-2852: improve javadoc > Improve code quality (2) > > > Key: PDFBOX-2852 > URL: https://issues.apache.org/jira/browse/PDFBOX-2852 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > Attachments: winansiencoding.patch, winansiencoding2.patch > > > This is a longterm issue for the task to improve code quality, by using the > [SonarQube > report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], > hints in different IDEs, the FindBugs tool and other code quality tools. > This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945444#comment-14945444 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1707098 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707098 ] PDFBOX-2852: remove closeQuietly that is done in finally; slight reformat > Improve code quality (2) > > > Key: PDFBOX-2852 > URL: https://issues.apache.org/jira/browse/PDFBOX-2852 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > Attachments: winansiencoding.patch, winansiencoding2.patch > > > This is a longterm issue for the task to improve code quality, by using the > [SonarQube > report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], > hints in different IDEs, the FindBugs tool and other code quality tools. > This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945440#comment-14945440 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1707096 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1707096 ] PDFBOX-2852: correct javadoc > Improve code quality (2) > > > Key: PDFBOX-2852 > URL: https://issues.apache.org/jira/browse/PDFBOX-2852 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr > Attachments: winansiencoding.patch, winansiencoding2.patch > > > This is a longterm issue for the task to improve code quality, by using the > [SonarQube > report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], > hints in different IDEs, the FindBugs tool and other code quality tools. > This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions
[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-2252: Attachment: pdfs_directionality3.xlsx Slightly updated full run (without OOM). I selected records with > 30 LTR and > 30 RTL tokens. > PDFTextStripper has problem with documents with mixed language directions > - > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.6, 2.0.0 >Reporter: Amir >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, > PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, > PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, > bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, > pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions
[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945375#comment-14945375 ] Tim Allison commented on PDFBOX-2252: - Sounds good to me. Given current workload on other stuff, I doubt I'll have a chance to finish major regression testing before Friday, and it may have to go into next week. :( > PDFTextStripper has problem with documents with mixed language directions > - > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.6, 2.0.0 >Reporter: Amir >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, > PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, > PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, > bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, test.pdf, > wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient
[ https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3007: Priority: Minor (was: Major) > Preflight cookbook example is inefficient > - > > Key: PDFBOX-3007 > URL: https://issues.apache.org/jira/browse/PDFBOX-3007 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.10, 1.8.11, 2.0.0 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 1.8.11, 2.0.0 > > > The example shown in > http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html > passes a DataSource object. This results in the creation of a temporary file. > The constructor with the DataSource makes only sense when working with URLs. > (And that only if http is cached, because preflight does an openStream() for > each PDF stream!) > It would be better to replace > {code} > FileDataSource fd = new FileDataSource(args[0]); > PreflightParser parser = new PreflightParser(fd); > {code} > with > {code} > PreflightParser parser = new PreflightParser(args[0]); > {code} > When working on that one, the example could also be copied to the 2.0 > cookbook directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient
[ https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3007: Issue Type: Improvement (was: Bug) > Preflight cookbook example is inefficient > - > > Key: PDFBOX-3007 > URL: https://issues.apache.org/jira/browse/PDFBOX-3007 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.10, 1.8.11, 2.0.0 >Reporter: Tilman Hausherr > Fix For: 1.8.11, 2.0.0 > > > The example shown in > http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html > passes a DataSource object. This results in the creation of a temporary file. > The constructor with the DataSource makes only sense when working with URLs. > (And that only if http is cached, because preflight does an openStream() for > each PDF stream!) > It would be better to replace > {code} > FileDataSource fd = new FileDataSource(args[0]); > PreflightParser parser = new PreflightParser(fd); > {code} > with > {code} > PreflightParser parser = new PreflightParser(args[0]); > {code} > When working on that one, the example could also be copied to the 2.0 > cookbook directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3007) Preflight cookbook example is inefficient
Tilman Hausherr created PDFBOX-3007: --- Summary: Preflight cookbook example is inefficient Key: PDFBOX-3007 URL: https://issues.apache.org/jira/browse/PDFBOX-3007 Project: PDFBox Issue Type: Bug Components: Documentation Affects Versions: 1.8.10, 1.8.11, 2.0.0 Reporter: Tilman Hausherr Fix For: 1.8.11, 2.0.0 The example shown in http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html passes a DataSource object. This results in the creation of a temporary file. The constructor with the DataSource makes only sense when working with URLs. (And that only if http is cached, because preflight does an openStream() for each PDF stream!) It would be better to replace {code} FileDataSource fd = new FileDataSource(args[0]); PreflightParser parser = new PreflightParser(fd); {code} with {code} PreflightParser parser = new PreflightParser(args[0]); {code} When working on that one, the example could also be copied to the 2.0 cookbook directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2755) Support filling hybrid PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945289#comment-14945289 ] Maruan Sahyoun commented on PDFBOX-2755: I've changed the ticket title to better describe the issue. There are 3 types of interactive PDF forms - classic AcroForms i.e. the form contains a /Fields array - dynamic XFA forms i.e. the form contains no or an empty /Fields array and there is an XFA entry - hybrid forms i.e. the form contains a /Fields array and an XFA entry For hybrid forms we are currently (only) updating the /Fields but not the data contained in the XFA. When an XFA aware reader such as Adobe Reader opens the form it looks for the current data in the XFA and NOT in the /Fields. An XFA unaware reader looks at the /Fields content. So in order to properly support hybrid forms for XFA aware readers in addition to updating the /Fields values we also need to update the XFA data. Currently XFA handling is only implemented to allow one to extract (externally update) and (re-) set the XFA content. [~lehmi] Many of the XFA PDFs have additional usage rights applied which means that we need to have the extended incremental update functionality to properly support that as otherwise we can update the document but the user get's an error that the document has been modified and the usage rights are removed. > Support filling hybrid PDF forms > > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2755) Support filling hybrid PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945300#comment-14945300 ] Maruan Sahyoun commented on PDFBOX-2755: As a side note PDF 2.0 (in the current draft) deprecates XFA and also NeedAppearances has changed in that is demands that a form filling application shall update the fields appearance and not depend on the (Adobe) reader to construct that by setting the flag. The same applies to annotations. > Support filling hybrid PDF forms > > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2755: --- Summary: Support filling hybrid PDF forms (was: Can't save the change to pdf file) > Support filling hybrid PDF forms > > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2755: --- Issue Type: Improvement (was: Bug) > Support filling hybrid PDF forms > > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2755) Can't save the change to pdf file
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2755: --- Affects Version/s: 2.0.0 1.8.10 > Can't save the change to pdf file > - > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Priority: Critical > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-2755) Can't save the change to pdf file
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun reassigned PDFBOX-2755: -- Assignee: Maruan Sahyoun > Can't save the change to pdf file > - > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun >Priority: Critical > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945260#comment-14945260 ] Maruan Sahyoun commented on PDFBOX-2998: thanks for the clarification and agreed that this is a simple example. OTOH as some characteristics are constant for a single text showing operator keeping the information that this was a single operation might be beneficial to improve the text extraction. Whatever changes we are looking into we need a good set of samples to verify the ideas. The text extraction we have is already at a good level. > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2755: --- Priority: Major (was: Critical) > Support filling hybrid PDF forms > > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Assignee: Maruan Sahyoun > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2755) Can't save the change to pdf file
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2755: --- Fix Version/s: 2.1.0 > Can't save the change to pdf file > - > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 1.8.9, 1.8.10, 2.0.0 >Reporter: hui xu >Priority: Critical > Fix For: 2.1.0 > > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945139#comment-14945139 ] Timo Boehme commented on PDFBOX-2998: - The problem is that with setting the character spacing to fancy values even a {{(oW)}} won't be one word but each character belongs to separate words (other example: 'not a word' with (taw) in one chunk). Thus while in most cases applications might group words correctly you have unfortunately a not too small number of 'misuses' which is not easy to detect. So the most general solution is to first separate all in single characters with correct position and group them using the same algorithm independent what the provided text chunk was. Text direction should in every case be respected while font (name/size) might work in most cases but there are cases with mixed fonts/size within same 'word' (chemical names using Greek characters etc.). > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945066#comment-14945066 ] Maruan Sahyoun commented on PDFBOX-2998: I'd even start a step before that - and people like [~jahewson] or [~lehmi] have more insight to answer that - and look at why a simple PDF with the following text content {code} BT /F1 12 Tf 100 700 Td (Hello World) Tj ET {code} produces 11 TextPositions. What are the benefits of that? What are the drawbacks? As a quick note. As we are trying to get PDFBox 2.0.0 out and I don't want to create new regressions it's unlikely that I will do further changes to the text extraction at that point in time. The initial Bidi support we have now because of you efforts is under test and if that passes will be part of PDFBox 2.0.0. That doesn't mean that we shouldn't continue the discussion. As text extraction is used a lot it's quite sensitive to changes. > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999 ] Andreas Meier edited comment on PDFBOX-2998 at 10/6/15 1:14 PM: The question is, when is a group of textpositions forming a word. My first thought is the location of the textposition, but it also depends on the font and the size of the font. In my opinion we can achieve a lot if we just enhance the current code by adding some checks for font type and font size. There may be many other border conditions, if you know some more, let me know. So one possible enhancement would be, to check that in the sorting algorithm: {code:title=TextPositionComparator.java|borderStyle=solid} ... public class TextPositionComparator implements Comparator { @Override public int compare(TextPosition pos1, TextPosition pos2) { // only compare text that is in the same direction if (pos1.getDir() < pos2.getDir()) { return -1; } else if (pos1.getDir() > pos2.getDir()) { return 1; } // FONT TYPE AND FONT SIZE CHECK START if (pos1.getFontSize() != pos2.getFontSize() || !pos1.getFont().getName().equals(pos2.getFont().getName())) { return -1; } // FONT TYPE AND FONT SIZE CHECK END // get the text direction adjusted coordinates float x1 = pos1.getXDirAdj(); float x2 = pos2.getXDirAdj(); float pos1YBottom = pos1.getYDirAdj(); float pos2YBottom = pos2.getYDirAdj(); ... {code} (BTW, if you wonder why the code snippet checks for the font Name and not the font itself: some blanks will not be represented in the toUnicode-tables of the pdf. Therefore the pdfbox fallback solution is used, which uses other fonts for the missing characters) Please correct me if I am wrong, this is just a simple minded idea that might work for some cases, but break others. was (Author: andreasmeier): The question is, when is a group of textpositions forming a word. My first thought is the location of the textposition, but it also depends on the font and the size of the font. In my opinion we can achieve a lot if we just enhance the current code by adding some checks for font type and font size. There may be many other border conditions, if you know some more, let me know. So one possible enhancement would be, to check that in the sorting algorithm: {code:title=TextPositionComparator.java|borderStyle=solid} ... public class TextPositionComparator implements Comparator { @Override public int compare(TextPosition pos1, TextPosition pos2) { // only compare text that is in the same direction if (pos1.getDir() < pos2.getDir()) { return -1; } else if (pos1.getDir() > pos2.getDir()) { return 1; } // FONT TYPE AND FONT SIZE CHECK START if (pos1.getFontSize() != pos2.getFontSize() || !pos1.getFont().getName().equals(pos2.getFont().getName())) { return -1; } // FONT TYPE AND FONT SIZE CHECK END // get the text direction adjusted coordinates float x1 = pos1.getXDirAdj(); float x2 = pos2.getXDirAdj(); float pos1YBottom = pos1.getYDirAdj(); float pos2YBottom = pos2.getYDirAdj(); ... {code} Please correct me if I am wrong, this is just a simple minded idea that might work for some cases, but break others. > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999 ] Andreas Meier edited comment on PDFBOX-2998 at 10/6/15 1:09 PM: The question is, when is a group of textpositions forming a word. My first thought is the location of the textposition, but it also depends on the font and the size of the font. In my opinion we can achieve a lot if we just enhance the current code by adding some checks for font type and font size. There may be many other border conditions, if you know some more, let me know. So one possible enhancement would be, to check that in the sorting algorithm: {code:title=TextPositionComparator.java|borderStyle=solid} ... public class TextPositionComparator implements Comparator { @Override public int compare(TextPosition pos1, TextPosition pos2) { // only compare text that is in the same direction if (pos1.getDir() < pos2.getDir()) { return -1; } else if (pos1.getDir() > pos2.getDir()) { return 1; } // FONT TYPE AND FONT SIZE CHECK START if (pos1.getFontSize() != pos2.getFontSize() || !pos1.getFont().getName().equals(pos2.getFont().getName())) { return -1; } // FONT TYPE AND FONT SIZE CHECK END // get the text direction adjusted coordinates float x1 = pos1.getXDirAdj(); float x2 = pos2.getXDirAdj(); float pos1YBottom = pos1.getYDirAdj(); float pos2YBottom = pos2.getYDirAdj(); ... {code} Please correct me if I am wrong, this is just a simple minded idea that might work for some cases, but break others. was (Author: andreasmeier): The question is, when is a group of textpositions forming a word. My first thought is the location of the textposition, but it also depends on the font and the size of the font. In my opinion we can achieve a lot if we just enhance the current code by adding some checks for font type and font size. So one possible enhancement would be, to check that in the sorting algorithm: {code:title=TextPositionComparator.java|borderStyle=solid} ... public class TextPositionComparator implements Comparator { @Override public int compare(TextPosition pos1, TextPosition pos2) { // only compare text that is in the same direction if (pos1.getDir() < pos2.getDir()) { return -1; } else if (pos1.getDir() > pos2.getDir()) { return 1; } // FONT TYPE AND FONT SIZE CHECK START if (pos1.getFontSize() != pos2.getFontSize() || !pos1.getFont().getName().equals(pos2.getFont().getName())) { return -1; } // FONT TYPE AND FONT SIZE CHECK END // get the text direction adjusted coordinates float x1 = pos1.getXDirAdj(); float x2 = pos2.getXDirAdj(); float pos1YBottom = pos1.getYDirAdj(); float pos2YBottom = pos2.getYDirAdj(); ... {code} Please correct me if I am wrong, this is just a simple minded idea that might work for some cases, but break others. > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities
[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999 ] Andreas Meier commented on PDFBOX-2998: --- The question is, when is a group of textpositions forming a word. My first thought is the location of the textposition, but it also depends on the font and the size of the font. In my opinion we can achieve a lot if we just enhance the current code by adding some checks for font type and font size. So one possible enhancement would be, to check that in the sorting algorithm: {code:title=TextPositionComparator.java|borderStyle=solid} ... public class TextPositionComparator implements Comparator { @Override public int compare(TextPosition pos1, TextPosition pos2) { // only compare text that is in the same direction if (pos1.getDir() < pos2.getDir()) { return -1; } else if (pos1.getDir() > pos2.getDir()) { return 1; } // FONT TYPE AND FONT SIZE CHECK START if (pos1.getFontSize() != pos2.getFontSize() || !pos1.getFont().getName().equals(pos2.getFont().getName())) { return -1; } // FONT TYPE AND FONT SIZE CHECK END // get the text direction adjusted coordinates float x1 = pos1.getXDirAdj(); float x2 = pos2.getXDirAdj(); float pos1YBottom = pos1.getYDirAdj(); float pos2YBottom = pos2.getYDirAdj(); ... {code} Please correct me if I am wrong, this is just a simple minded idea that might work for some cases, but break others. > Enhance the text extraction capabilities > > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction >Affects Versions: 2.0.0 >Reporter: Andreas Meier > Attachments: TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2755) Can't save the change to pdf file
[ https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944850#comment-14944850 ] Andreas Lehmkühler commented on PDFBOX-2755: Is here any TODO for us left? > Can't save the change to pdf file > - > > Key: PDFBOX-2755 > URL: https://issues.apache.org/jira/browse/PDFBOX-2755 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 1.8.9 >Reporter: hui xu >Priority: Critical > Attachments: formtestFailed.pdf, formtestOK.pdf > > > Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value > is changed but can not been saved in pdf file. > I added 2 more lines code: > List newList = new ArrayList(); > . > acroForm.setFields(newList); > The pdf was saved with the change. But reload the saved pdf file and try to > reset some values, it throws NullPointerException, the file can't getFields(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Apache PDFBox October 2015 board report due
> Maruan Sahyoun hat am 6. Oktober 2015 um 08:18 > geschrieben: > > > Hi, > > Am 06.10.2015 um 08:07 schrieb Andreas Lehmkuehler : > > > > Am 05.10.2015 um 23:02 schrieb Maruan Sahyoun: > >> Hi, > >> > >> + 1 > >> > >> One thing we might want to address is the large numbers of eMail to dev > >> because of the commit etc. stuff. > > Hmmm, I'm not sure that I've got your point. Do you want to explain the high > > number of mails on dev@ compared to users@? > > yes - as removing the commit messages from dev the traffic as users is higher > (which is good) There are many other projects using the dev list the same way so that we don't have to explain that explicitly. BR Andreas > Maruan > > > > >> > >> Maruan > >> > >>> Am 05.10.2015 um 19:47 schrieb Andreas Lehmkuehler : > >>> > >>> Hi, > >>> > >>> find attached a quick draft of the board report we're expected to submit > >>> this > >>> month. It's based upon the report template which can be found at [1] > >>> > >>> > >>> Any further comments, objections or additions? > >>> > >>> > >>> > >>> > >>> Report from the Apache PDFBox committee [Andreas Lehmkühler] > >>> > >>> ## Description: > >>> The Apache PDFBox library is an open source Java tool for working with > >>> PDF documents. > >>> > >>> ## Activity: > >>> - after a long time of hard work we decided to cut a release candidate for > >>> 2.0.0 this october. As we are down to 6 open tickets I'm quite > >>> optimistic > >>> that it'll really come true > >>> - we joined forces with Tim Allison from Apache TIKA to run some bulk > >>> tests > >>> from time to time to avoid regressions > >>> > >>> ## Health report: > >>> - there is a steady stream of contributions, bug reports and questions on > >>> the mailing lists > >>> - the core team consists of 4 - 5 active developers > >>> - we expect to attract more people once our new major release is out of > >>> the > >>> door > >>> > >>> ## Issues: > >>> - there are no issues requiring board attention at this time" > >>> > >>> ## PMC changes: > >>> > >>> - Currently 16 PMC members. > >>> - No new PMC members added in the last 3 months > >>> - Last PMC addition was John Hewson at Thu Feb 06 2014 > >>> > >>> ## LDAP changes: > >>> > >>> - Currently 16 committers and 16 committee group members. > >>> - No new committee group members added in the last 3 months > >>> - No new committers added in the last 3 months > >>> - Last committer addition was John Hewson at Fri Feb 07 2014 > >>> > >>> ## Releases: > >>> > >>> - 1.8.10 was released on Wed Jul 22 2015 > >>> > >>> ## Mailing list activity: > >>> > >>> - us...@pdfbox.apache.org: > >>>- 497 subscribers (up 6 in the last 3 months): > >>>- 519 emails sent to list (578 in previous quarter) > >>> > >>> - dev@pdfbox.apache.org: > >>>- 145 subscribers (down -4 in the last 3 months): > >>>- 2932 emails sent to list (2594 in previous quarter) > >>> > >>> > >>> ## JIRA activity: > >>> > >>> - 151 JIRA tickets created in the last 3 months > >>> - 143 JIRA tickets closed/resolved in the last 3 months > >>> > >>> > >>> > >>> > >>> BR > >>> Andreas Lehmkühler > >>> > >>> [1] https://reporter.apache.org/?pdfbox > >>> > >>> > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>> > >> > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3005) Incorrect property names for lists
[ https://issues.apache.org/jira/browse/PDFBOX-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944721#comment-14944721 ] Maruan Sahyoun commented on PDFBOX-3005: thanks for your feedback. The check for regressions will take a little longer as the list of files is huge. > Incorrect property names for lists > -- > > Key: PDFBOX-3005 > URL: https://issues.apache.org/jira/browse/PDFBOX-3005 > Project: PDFBox > Issue Type: Bug > Components: Writing, XmpBox >Affects Versions: 2.0.0 >Reporter: Evgeniy Muravitskiy >Assignee: Maruan Sahyoun > Fix For: 2.0.0 > > Attachments: isis09_04.pdf, restest.xml > > > When i write code as follows: > {code} > PDDocument document = PDDocument.load(new File(FILE_PATH)); > DomXmpParser parser = new DomXmpParser(); > XMPMetadata metadata = > parser.parse(document.getDocumentCatalog().getMetadata().getStream().getUnfilteredStream()); > metadata.removeSchema(metadata.getPDFIdentificationSchema()); > OutputStream res = new FileOutputStream(RESULT_XML); > new XmpSerializer().serialize(metadata, res, true); > {code} > I got xml which contain following tag: > {code} > > > Tomioka, Satoshi > > > {code} > but instead of {{rdf:creator}} must be {{rdf:li}}. This problem reproducible > also for others DublinCoreShema properties which contains lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3005) Incorrect property names for lists
[ https://issues.apache.org/jira/browse/PDFBOX-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944671#comment-14944671 ] Evgeniy Muravitskiy commented on PDFBOX-3005: - Yes, your changes fix problem. Thanks > Incorrect property names for lists > -- > > Key: PDFBOX-3005 > URL: https://issues.apache.org/jira/browse/PDFBOX-3005 > Project: PDFBox > Issue Type: Bug > Components: Writing, XmpBox >Affects Versions: 2.0.0 >Reporter: Evgeniy Muravitskiy >Assignee: Maruan Sahyoun > Fix For: 2.0.0 > > Attachments: isis09_04.pdf, restest.xml > > > When i write code as follows: > {code} > PDDocument document = PDDocument.load(new File(FILE_PATH)); > DomXmpParser parser = new DomXmpParser(); > XMPMetadata metadata = > parser.parse(document.getDocumentCatalog().getMetadata().getStream().getUnfilteredStream()); > metadata.removeSchema(metadata.getPDFIdentificationSchema()); > OutputStream res = new FileOutputStream(RESULT_XML); > new XmpSerializer().serialize(metadata, res, true); > {code} > I got xml which contain following tag: > {code} > > > Tomioka, Satoshi > > > {code} > but instead of {{rdf:creator}} must be {{rdf:li}}. This problem reproducible > also for others DublinCoreShema properties which contains lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Apache PDFBox October 2015 board report due
Hi, +1 Best, Timo Am 05.10.2015 um 19:47 schrieb Andreas Lehmkuehler: Hi, find attached a quick draft of the board report we're expected to submit this month. It's based upon the report template which can be found at [1] Any further comments, objections or additions? Report from the Apache PDFBox committee [Andreas Lehmkühler] ## Description: The Apache PDFBox library is an open source Java tool for working with PDF documents. ## Activity: - after a long time of hard work we decided to cut a release candidate for 2.0.0 this october. As we are down to 6 open tickets I'm quite optimistic that it'll really come true - we joined forces with Tim Allison from Apache TIKA to run some bulk tests from time to time to avoid regressions ## Health report: - there is a steady stream of contributions, bug reports and questions on the mailing lists - the core team consists of 4 - 5 active developers - we expect to attract more people once our new major release is out of the door ## Issues: - there are no issues requiring board attention at this time" ## PMC changes: - Currently 16 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was John Hewson at Thu Feb 06 2014 ## LDAP changes: - Currently 16 committers and 16 committee group members. - No new committee group members added in the last 3 months - No new committers added in the last 3 months - Last committer addition was John Hewson at Fri Feb 07 2014 ## Releases: - 1.8.10 was released on Wed Jul 22 2015 ## Mailing list activity: - us...@pdfbox.apache.org: - 497 subscribers (up 6 in the last 3 months): - 519 emails sent to list (578 in previous quarter) - dev@pdfbox.apache.org: - 145 subscribers (down -4 in the last 3 months): - 2932 emails sent to list (2594 in previous quarter) ## JIRA activity: - 151 JIRA tickets created in the last 3 months - 143 JIRA tickets closed/resolved in the last 3 months BR Andreas Lehmkühler [1] https://reporter.apache.org/?pdfbox - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org -- Timo Boehme OntoChem IT Solutions GmbH Blücherstraße 24 06120 Halle (Saale) Germany phone: +49 345 478 047 4 | fax: +49 345 478 047 1 email: ulf.la...@ontochem.com | web: www.ontochem.com HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824 managing director : Lutz Weber - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org