[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324
 ] 

Andreas Meier edited comment on PDFBOX-2998 at 10/7/15 6:03 AM:


I just wanted to fuel the discussion with my snippet.
My intention is not to provide code that breaks an already great extraction 
engine ;)

{quote}
I'd even start a step before that
{quote}
Depends on what is possible at the lower Levels...

I don't know if I am the right person to take part in that discussion any 
further, but I will try to provide the "simple view" on a higher level, to 
address the problem:
 
- Might it be useful to hold some Information like "(Hello World)" in a 
(meta-)information store, so that pdfbox can later take the single characters 
and form the word again? (No fonttype or -size needed, just simple character 
matching based on position and Rotation...)
- Would it make sense to check for fonttype and -size and just handle cases 
like chemical names? [~tboehme] are there any other reasons for different 
font/size in a word you know?




was (Author: andreasmeier):
I just wanted to fuel the discussion with my snippet.
My intention is not to provide code that breaks an already great extraction 
engine ;)

{quote}
I'd even start a step before that
{quote}
Depends on what is possible at the lower Levels...

I don't know if I am the right person to take part in that discussion any 
further, but I will try to provide the "simple view" on a higher level, to 
address the problem:
 
- Might it be useful to hold some Information like "(Hello World)" in a 
(meta-)information store, so that pdfbox can later take the single characters 
and form the word again? (No fonttype or -size needed, just simple character 
matching based on position and Rotation...)
- Would it make sense to check for fonttype and -size and just handle cases 
like checmical names ([~tboehme] are there any other reasons for different 
font/size in a word you know?)



> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324
 ] 

Andreas Meier commented on PDFBOX-2998:
---

I just wanted to fuel the discussion with my snippet.
My intention is not to provide code that breaks an already great extraction 
engine ;)

{quote}
I'd even start a step before that
{quote}
Depends on what is possible at the lower Levels...

I don't know if I am the right person to take part in that discussion any 
further, but I will try to provide the "simple view" on a higher level, to 
address the problem:
 
- Might it be useful to hold some Information like "(Hello World)" in a 
(meta-)information store, so that pdfbox can later take the single characters 
and form the word again? (No fonttype or -size needed, just simple character 
matching based on position and Rotation...)
- Would it make sense to check for fonttype and -size and just handle cases 
like checmical names ([~tboehme] are there any other reasons for different 
font/size in a word you know?)



> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945842#comment-14945842
 ] 

ASF subversion and git services commented on PDFBOX-2340:
-

Commit 1707146 from [~msahyoun] in branch 'cmssite/trunk'
[ https://svn.apache.org/r1707146 ]

PDFBOX-2340: add info how to encrypt password

> Overhaul PDFBox Documentation
> -
>
> Key: PDFBOX-2340
> URL: https://issues.apache.org/jira/browse/PDFBOX-2340
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Critical
> Attachments: Mockup-20140912.png, Mockup_Documentation.png
>
>
> In oder to make it easier for users of PDFBox to work with the library there 
> shall be an enhanced documentation consisting of an introduction, API 
> references and more well documented examples and code snippets (Cookbook).
> In order to make it easier to contribute the Cookbook shall be build 
> automatically from the examples/snippet ‚repository‘.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945815#comment-14945815
 ] 

ASF subversion and git services commented on PDFBOX-2340:
-

Commit 1707143 from [~msahyoun] in branch 'cmssite/trunk'
[ https://svn.apache.org/r1707143 ]

PDFBOX-2340: update layout

> Overhaul PDFBox Documentation
> -
>
> Key: PDFBOX-2340
> URL: https://issues.apache.org/jira/browse/PDFBOX-2340
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Critical
> Attachments: Mockup-20140912.png, Mockup_Documentation.png
>
>
> In oder to make it easier for users of PDFBox to work with the library there 
> shall be an enhanced documentation consisting of an introduction, API 
> references and more well documented examples and code snippets (Cookbook).
> In order to make it easier to contribute the Cookbook shall be build 
> automatically from the examples/snippet ‚repository‘.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945814#comment-14945814
 ] 

ASF subversion and git services commented on PDFBOX-2340:
-

Commit 1707142 from [~msahyoun] in branch 'cmssite/trunk'
[ https://svn.apache.org/r1707142 ]

PDFBOX-2340: update layout

> Overhaul PDFBox Documentation
> -
>
> Key: PDFBOX-2340
> URL: https://issues.apache.org/jira/browse/PDFBOX-2340
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Critical
> Attachments: Mockup-20140912.png, Mockup_Documentation.png
>
>
> In oder to make it easier for users of PDFBox to work with the library there 
> shall be an enhanced documentation consisting of an introduction, API 
> references and more well documented examples and code snippets (Cookbook).
> In order to make it easier to contribute the Cookbook shall be build 
> automatically from the examples/snippet ‚repository‘.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3008) Memory leak in preflight

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3008:

Description: 
PreflightParser has this:
{code}
public PreflightParser(DataSource dataSource) throws IOException
{
// TODO move file handling outside of the parser
super(new 
RandomAccessBufferedFileInputStream(dataSource.getInputStream()));
this.setLenient(false);
this.originalDocument = dataSource;
}
{code}
The TODO message looks like a design issue, but it is much worse: the 
RandomAccessBufferedFileInputStream is never closed, which results in the temp 
file not being deleted. The file parameter constructor has the same problem, 
i.e. that the RandomAccessBufferedFileInputStream object is not closed (no temp 
file there).

  was:
PreflightParser has this:
{code}
public PreflightParser(DataSource dataSource) throws IOException
{
// TODO move file handling outside of the parser
super(new 
RandomAccessBufferedFileInputStream(dataSource.getInputStream()));
this.setLenient(false);
this.originalDocument = dataSource;
}
{code}
The TODO message looks like a design issue, but it is much worse: the 
RandomAccessBufferedFileInputStream is never closed. The file parameter 
constructor has the same problem.


> Memory leak in preflight
> 
>
> Key: PDFBOX-3008
> URL: https://issues.apache.org/jira/browse/PDFBOX-3008
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>
> PreflightParser has this:
> {code}
> public PreflightParser(DataSource dataSource) throws IOException
> {
> // TODO move file handling outside of the parser
> super(new 
> RandomAccessBufferedFileInputStream(dataSource.getInputStream()));
> this.setLenient(false);
> this.originalDocument = dataSource;
> }
> {code}
> The TODO message looks like a design issue, but it is much worse: the 
> RandomAccessBufferedFileInputStream is never closed, which results in the 
> temp file not being deleted. The file parameter constructor has the same 
> problem, i.e. that the RandomAccessBufferedFileInputStream object is not 
> closed (no temp file there).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945784#comment-14945784
 ] 

ASF subversion and git services commented on PDFBOX-2340:
-

Commit 1707141 from [~msahyoun] in branch 'cmssite/trunk'
[ https://svn.apache.org/r1707141 ]

PDFBOX-2340: document how to update the Javadoc

> Overhaul PDFBox Documentation
> -
>
> Key: PDFBOX-2340
> URL: https://issues.apache.org/jira/browse/PDFBOX-2340
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Critical
> Attachments: Mockup-20140912.png, Mockup_Documentation.png
>
>
> In oder to make it easier for users of PDFBox to work with the library there 
> shall be an enhanced documentation consisting of an introduction, API 
> references and more well documented examples and code snippets (Cookbook).
> In order to make it easier to contribute the Cookbook shall be build 
> automatically from the examples/snippet ‚repository‘.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2340) Overhaul PDFBox Documentation

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945699#comment-14945699
 ] 

ASF subversion and git services commented on PDFBOX-2340:
-

Commit 1707134 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707134 ]

PDFBOX-2340: semi automate javadoc generation

> Overhaul PDFBox Documentation
> -
>
> Key: PDFBOX-2340
> URL: https://issues.apache.org/jira/browse/PDFBOX-2340
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Critical
> Attachments: Mockup-20140912.png, Mockup_Documentation.png
>
>
> In oder to make it easier for users of PDFBox to work with the library there 
> shall be an enhanced documentation consisting of an introduction, API 
> references and more well documented examples and code snippets (Cookbook).
> In order to make it easier to contribute the Cookbook shall be build 
> automatically from the examples/snippet ‚repository‘.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945520#comment-14945520
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1707114 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707114 ]

PDFBOX-2852: improve javadoc

> Improve code quality (2)
> 
>
> Key: PDFBOX-2852
> URL: https://issues.apache.org/jira/browse/PDFBOX-2852
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
> Attachments: winansiencoding.patch, winansiencoding2.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3007:

Description: 
The example shown in
http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
passes a DataSource object. This results in the creation of a temporary file. 
The constructor with the DataSource makes only sense when working with URLs. 
(And that only if http is cached, because preflight does an openStream() for 
each PDF stream!)

It would be better to replace
{code}
FileDataSource fd = new FileDataSource(args[0]);
PreflightParser parser = new PreflightParser(fd);
{code}
with
{code}
PreflightParser parser = new PreflightParser(args[0]);
{code}

Edit: removed 2.0, as the example may have to change after solving PDFBOX-3007.

  was:
The example shown in
http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
passes a DataSource object. This results in the creation of a temporary file. 
The constructor with the DataSource makes only sense when working with URLs. 
(And that only if http is cached, because preflight does an openStream() for 
each PDF stream!)

It would be better to replace
{code}
FileDataSource fd = new FileDataSource(args[0]);
PreflightParser parser = new PreflightParser(fd);
{code}
with
{code}
PreflightParser parser = new PreflightParser(args[0]);
{code}

When working on that one, the example could also be copied to the 2.0 cookbook 
directory.


> Preflight cookbook example is inefficient
> -
>
> Key: PDFBOX-3007
> URL: https://issues.apache.org/jira/browse/PDFBOX-3007
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.10, 1.8.11
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.8.11
>
>
> The example shown in
> http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
> passes a DataSource object. This results in the creation of a temporary file. 
> The constructor with the DataSource makes only sense when working with URLs. 
> (And that only if http is cached, because preflight does an openStream() for 
> each PDF stream!)
> It would be better to replace
> {code}
> FileDataSource fd = new FileDataSource(args[0]);
> PreflightParser parser = new PreflightParser(fd);
> {code}
> with
> {code}
> PreflightParser parser = new PreflightParser(args[0]);
> {code}
> Edit: removed 2.0, as the example may have to change after solving 
> PDFBOX-3007.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3007:

Affects Version/s: (was: 2.0.0)

> Preflight cookbook example is inefficient
> -
>
> Key: PDFBOX-3007
> URL: https://issues.apache.org/jira/browse/PDFBOX-3007
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.10, 1.8.11
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.8.11
>
>
> The example shown in
> http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
> passes a DataSource object. This results in the creation of a temporary file. 
> The constructor with the DataSource makes only sense when working with URLs. 
> (And that only if http is cached, because preflight does an openStream() for 
> each PDF stream!)
> It would be better to replace
> {code}
> FileDataSource fd = new FileDataSource(args[0]);
> PreflightParser parser = new PreflightParser(fd);
> {code}
> with
> {code}
> PreflightParser parser = new PreflightParser(args[0]);
> {code}
> When working on that one, the example could also be copied to the 2.0 
> cookbook directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3007:

Fix Version/s: (was: 2.0.0)

> Preflight cookbook example is inefficient
> -
>
> Key: PDFBOX-3007
> URL: https://issues.apache.org/jira/browse/PDFBOX-3007
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.10, 1.8.11
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.8.11
>
>
> The example shown in
> http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
> passes a DataSource object. This results in the creation of a temporary file. 
> The constructor with the DataSource makes only sense when working with URLs. 
> (And that only if http is cached, because preflight does an openStream() for 
> each PDF stream!)
> It would be better to replace
> {code}
> FileDataSource fd = new FileDataSource(args[0]);
> PreflightParser parser = new PreflightParser(fd);
> {code}
> with
> {code}
> PreflightParser parser = new PreflightParser(args[0]);
> {code}
> When working on that one, the example could also be copied to the 2.0 
> cookbook directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945490#comment-14945490
 ] 

Maruan Sahyoun commented on PDFBOX-2252:


The effort you put into that is of great help. No need to be sorry that it 
takes a little longer.

> PDFTextStripper has problem with documents with mixed language directions
> -
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.6, 2.0.0
>Reporter: Amir
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, 
> pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945494#comment-14945494
 ] 

Maruan Sahyoun commented on PDFBOX-2252:


Thanks a lot - very valuable resource to look for test candidates

> PDFTextStripper has problem with documents with mixed language directions
> -
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.6, 2.0.0
>Reporter: Amir
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, 
> pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3008) Memory leak in preflight

2015-10-06 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-3008:
---

 Summary: Memory leak in preflight
 Key: PDFBOX-3008
 URL: https://issues.apache.org/jira/browse/PDFBOX-3008
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
Reporter: Tilman Hausherr


PreflightParser has this:
{code}
public PreflightParser(DataSource dataSource) throws IOException
{
// TODO move file handling outside of the parser
super(new 
RandomAccessBufferedFileInputStream(dataSource.getInputStream()));
this.setLenient(false);
this.originalDocument = dataSource;
}
{code}
The TODO message looks like a design issue, but it is much worse: the 
RandomAccessBufferedFileInputStream is never closed. The file parameter 
constructor has the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945474#comment-14945474
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1707110 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707110 ]

PDFBOX-2852: improve javadoc; rename parameter; simplify code

> Improve code quality (2)
> 
>
> Key: PDFBOX-2852
> URL: https://issues.apache.org/jira/browse/PDFBOX-2852
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
> Attachments: winansiencoding.patch, winansiencoding2.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945458#comment-14945458
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1707105 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707105 ]

PDFBOX-2852: improve javadoc

> Improve code quality (2)
> 
>
> Key: PDFBOX-2852
> URL: https://issues.apache.org/jira/browse/PDFBOX-2852
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
> Attachments: winansiencoding.patch, winansiencoding2.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945444#comment-14945444
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1707098 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707098 ]

PDFBOX-2852: remove closeQuietly that is done in finally; slight reformat

> Improve code quality (2)
> 
>
> Key: PDFBOX-2852
> URL: https://issues.apache.org/jira/browse/PDFBOX-2852
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
> Attachments: winansiencoding.patch, winansiencoding2.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-10-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945440#comment-14945440
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1707096 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1707096 ]

PDFBOX-2852: correct javadoc

> Improve code quality (2)
> 
>
> Key: PDFBOX-2852
> URL: https://issues.apache.org/jira/browse/PDFBOX-2852
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
> Attachments: winansiencoding.patch, winansiencoding2.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

2015-10-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2252:

Attachment: pdfs_directionality3.xlsx

Slightly updated full run (without OOM).  I selected records with > 30 LTR and 
> 30 RTL tokens.


> PDFTextStripper has problem with documents with mixed language directions
> -
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.6, 2.0.0
>Reporter: Amir
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, 
> pdfs_directionality3.xlsx, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

2015-10-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945375#comment-14945375
 ] 

Tim Allison commented on PDFBOX-2252:
-

Sounds good to me.  Given current workload on other stuff, I doubt I'll have a 
chance to finish major regression testing before Friday, and it may have to go 
into next week. :(

> PDFTextStripper has problem with documents with mixed language directions
> -
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.6, 2.0.0
>Reporter: Amir
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, test.pdf, 
> wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3007:

Priority: Minor  (was: Major)

> Preflight cookbook example is inefficient
> -
>
> Key: PDFBOX-3007
> URL: https://issues.apache.org/jira/browse/PDFBOX-3007
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.10, 1.8.11, 2.0.0
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.8.11, 2.0.0
>
>
> The example shown in
> http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
> passes a DataSource object. This results in the creation of a temporary file. 
> The constructor with the DataSource makes only sense when working with URLs. 
> (And that only if http is cached, because preflight does an openStream() for 
> each PDF stream!)
> It would be better to replace
> {code}
> FileDataSource fd = new FileDataSource(args[0]);
> PreflightParser parser = new PreflightParser(fd);
> {code}
> with
> {code}
> PreflightParser parser = new PreflightParser(args[0]);
> {code}
> When working on that one, the example could also be copied to the 2.0 
> cookbook directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3007:

Issue Type: Improvement  (was: Bug)

> Preflight cookbook example is inefficient
> -
>
> Key: PDFBOX-3007
> URL: https://issues.apache.org/jira/browse/PDFBOX-3007
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.10, 1.8.11, 2.0.0
>Reporter: Tilman Hausherr
> Fix For: 1.8.11, 2.0.0
>
>
> The example shown in
> http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
> passes a DataSource object. This results in the creation of a temporary file. 
> The constructor with the DataSource makes only sense when working with URLs. 
> (And that only if http is cached, because preflight does an openStream() for 
> each PDF stream!)
> It would be better to replace
> {code}
> FileDataSource fd = new FileDataSource(args[0]);
> PreflightParser parser = new PreflightParser(fd);
> {code}
> with
> {code}
> PreflightParser parser = new PreflightParser(args[0]);
> {code}
> When working on that one, the example could also be copied to the 2.0 
> cookbook directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3007) Preflight cookbook example is inefficient

2015-10-06 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-3007:
---

 Summary: Preflight cookbook example is inefficient
 Key: PDFBOX-3007
 URL: https://issues.apache.org/jira/browse/PDFBOX-3007
 Project: PDFBox
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.10, 1.8.11, 2.0.0
Reporter: Tilman Hausherr
 Fix For: 1.8.11, 2.0.0


The example shown in
http://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
passes a DataSource object. This results in the creation of a temporary file. 
The constructor with the DataSource makes only sense when working with URLs. 
(And that only if http is cached, because preflight does an openStream() for 
each PDF stream!)

It would be better to replace
{code}
FileDataSource fd = new FileDataSource(args[0]);
PreflightParser parser = new PreflightParser(fd);
{code}
with
{code}
PreflightParser parser = new PreflightParser(args[0]);
{code}

When working on that one, the example could also be copied to the 2.0 cookbook 
directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2755) Support filling hybrid PDF forms

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945289#comment-14945289
 ] 

Maruan Sahyoun commented on PDFBOX-2755:


I've changed the ticket title to better describe the issue. There are 3 types 
of interactive PDF forms

- classic AcroForms i.e. the form contains a /Fields array
- dynamic XFA forms i.e. the form contains no or an empty /Fields array and 
there is an XFA entry
- hybrid forms i.e. the form contains a /Fields array and an XFA entry

For hybrid forms we are currently (only) updating the /Fields but not the data 
contained in the XFA. When an XFA aware reader such as Adobe Reader opens the 
form it looks for the current data in the XFA and NOT in the /Fields. An XFA 
unaware reader looks at the /Fields content. So in order to properly support 
hybrid forms for XFA aware readers in addition to updating the /Fields values 
we also need to update the XFA data. Currently XFA handling is only implemented 
to allow one to extract (externally update) and (re-) set the XFA content.

[~lehmi] Many of the XFA PDFs have additional usage rights applied which means 
that we need to have the extended incremental update functionality to properly 
support that as otherwise we can update the document but the user get's an 
error that the document has been modified and the usage rights are removed.

> Support filling hybrid PDF forms
> 
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2755) Support filling hybrid PDF forms

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945300#comment-14945300
 ] 

Maruan Sahyoun commented on PDFBOX-2755:


As a side note PDF 2.0 (in the current draft) deprecates XFA and also 
NeedAppearances has changed in that is demands that a form filling application 
shall update the fields appearance and not depend on the (Adobe) reader to 
construct that by setting the flag. The same applies to annotations.

> Support filling hybrid PDF forms
> 
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2755:
---
Summary: Support filling hybrid PDF forms  (was: Can't save the change to 
pdf file)

> Support filling hybrid PDF forms
> 
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2755:
---
Issue Type: Improvement  (was: Bug)

> Support filling hybrid PDF forms
> 
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2755) Can't save the change to pdf file

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2755:
---
Affects Version/s: 2.0.0
   1.8.10

> Can't save the change to pdf file
> -
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Assigned] (PDFBOX-2755) Can't save the change to pdf file

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun reassigned PDFBOX-2755:
--

Assignee: Maruan Sahyoun

> Can't save the change to pdf file
> -
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945260#comment-14945260
 ] 

Maruan Sahyoun commented on PDFBOX-2998:


thanks for the clarification and agreed that this is a simple example. OTOH as 
some characteristics are constant for a single text showing operator keeping 
the information that this was a single operation might be beneficial to improve 
the text extraction. Whatever changes we are looking into we need a good set of 
samples to verify the ideas. The text extraction we have is already at a good 
level.

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2755) Support filling hybrid PDF forms

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2755:
---
Priority: Major  (was: Critical)

> Support filling hybrid PDF forms
> 
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Assignee: Maruan Sahyoun
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2755) Can't save the change to pdf file

2015-10-06 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2755:
---
Fix Version/s: 2.1.0

> Can't save the change to pdf file
> -
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 1.8.9, 1.8.10, 2.0.0
>Reporter: hui xu
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Timo Boehme (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945139#comment-14945139
 ] 

Timo Boehme commented on PDFBOX-2998:
-

The problem is that with setting the character spacing to fancy values even a 
{{(oW)}} won't be one word but each character belongs to separate words (other 
example: 'not a word' with (taw) in one chunk). Thus while in most cases 
applications might group words correctly you have unfortunately a not too small 
number of 'misuses' which is not easy to detect. So the most general solution 
is to first separate all in single characters with correct position and group 
them using the same algorithm independent what the provided text chunk was.

Text direction should in every case be respected while font (name/size) might 
work in most cases but there are cases with mixed fonts/size within same 'word' 
(chemical names using Greek characters etc.).

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945066#comment-14945066
 ] 

Maruan Sahyoun commented on PDFBOX-2998:


I'd even start a step before that - and people like [~jahewson] or [~lehmi] 
have more insight to answer that - and look at why a simple PDF with the 
following text content

{code}
BT
  /F1 12 Tf
  100 700 Td
  (Hello World) Tj
ET
{code}

produces 11 TextPositions. What are the benefits of that? What are the 
drawbacks?

As a quick note. As we are trying to get PDFBox 2.0.0 out and I don't want to 
create new regressions it's unlikely that I will do further changes to the text 
extraction at that point in time. The initial Bidi support we have now because 
of you efforts is under test and if that passes will be part of PDFBox 2.0.0. 
That doesn't mean that we shouldn't continue the discussion. As text extraction 
is used a lot it's quite sensitive to changes.

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999
 ] 

Andreas Meier edited comment on PDFBOX-2998 at 10/6/15 1:14 PM:


The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size. There may be many other border 
conditions, if you know some more, let me know.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator
{
@Override
public int compare(TextPosition pos1, TextPosition pos2)
{
// only compare text that is in the same direction
if (pos1.getDir() < pos2.getDir())
{
return -1;
}
else if (pos1.getDir() > pos2.getDir())
{
return 1;
}


// FONT TYPE AND FONT SIZE CHECK START
if (pos1.getFontSize() != pos2.getFontSize() ||
!pos1.getFont().getName().equals(pos2.getFont().getName()))
{
return -1;
}
// FONT TYPE AND FONT SIZE CHECK END


// get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();

float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj(); 

...
{code}

(BTW, if you wonder why the code snippet checks for the font Name and not the 
font itself: some blanks will not be represented in the toUnicode-tables of the 
pdf. Therefore the pdfbox fallback solution is used, which uses other fonts for 
the missing characters)

Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.


was (Author: andreasmeier):
The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size. There may be many other border 
conditions, if you know some more, let me know.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator
{
@Override
public int compare(TextPosition pos1, TextPosition pos2)
{
// only compare text that is in the same direction
if (pos1.getDir() < pos2.getDir())
{
return -1;
}
else if (pos1.getDir() > pos2.getDir())
{
return 1;
}


// FONT TYPE AND FONT SIZE CHECK START
if (pos1.getFontSize() != pos2.getFontSize() ||
!pos1.getFont().getName().equals(pos2.getFont().getName()))
{
return -1;
}
// FONT TYPE AND FONT SIZE CHECK END


// get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();

float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj(); 

...
{code}


Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999
 ] 

Andreas Meier edited comment on PDFBOX-2998 at 10/6/15 1:09 PM:


The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size. There may be many other border 
conditions, if you know some more, let me know.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator
{
@Override
public int compare(TextPosition pos1, TextPosition pos2)
{
// only compare text that is in the same direction
if (pos1.getDir() < pos2.getDir())
{
return -1;
}
else if (pos1.getDir() > pos2.getDir())
{
return 1;
}


// FONT TYPE AND FONT SIZE CHECK START
if (pos1.getFontSize() != pos2.getFontSize() ||
!pos1.getFont().getName().equals(pos2.getFont().getName()))
{
return -1;
}
// FONT TYPE AND FONT SIZE CHECK END


// get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();

float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj(); 

...
{code}


Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.


was (Author: andreasmeier):
The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator
{
@Override
public int compare(TextPosition pos1, TextPosition pos2)
{
// only compare text that is in the same direction
if (pos1.getDir() < pos2.getDir())
{
return -1;
}
else if (pos1.getDir() > pos2.getDir())
{
return 1;
}


// FONT TYPE AND FONT SIZE CHECK START
if (pos1.getFontSize() != pos2.getFontSize() ||
!pos1.getFont().getName().equals(pos2.getFont().getName()))
{
return -1;
}
// FONT TYPE AND FONT SIZE CHECK END


// get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();

float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj(); 

...
{code}


Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2998) Enhance the text extraction capabilities

2015-10-06 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999
 ] 

Andreas Meier commented on PDFBOX-2998:
---

The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator
{
@Override
public int compare(TextPosition pos1, TextPosition pos2)
{
// only compare text that is in the same direction
if (pos1.getDir() < pos2.getDir())
{
return -1;
}
else if (pos1.getDir() > pos2.getDir())
{
return 1;
}


// FONT TYPE AND FONT SIZE CHECK START
if (pos1.getFontSize() != pos2.getFontSize() ||
!pos1.getFont().getName().equals(pos2.getFont().getName()))
{
return -1;
}
// FONT TYPE AND FONT SIZE CHECK END


// get the text direction adjusted coordinates
float x1 = pos1.getXDirAdj();
float x2 = pos2.getXDirAdj();

float pos1YBottom = pos1.getYDirAdj();
float pos2YBottom = pos2.getYDirAdj(); 

...
{code}


Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.

> Enhance the text extraction capabilities
> 
>
> Key: PDFBOX-2998
> URL: https://issues.apache.org/jira/browse/PDFBOX-2998
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.0
>Reporter: Andreas Meier
> Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2755) Can't save the change to pdf file

2015-10-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944850#comment-14944850
 ] 

Andreas Lehmkühler commented on PDFBOX-2755:


Is here any TODO for us left?

> Can't save the change to pdf file
> -
>
> Key: PDFBOX-2755
> URL: https://issues.apache.org/jira/browse/PDFBOX-2755
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 1.8.9
>Reporter: hui xu
>Priority: Critical
> Attachments: formtestFailed.pdf, formtestOK.pdf
>
>
> Ran SetField.java in package org.apache.pdfbox.examples.fdf, the field value 
> is changed but can not been saved in pdf file.
> I added 2 more lines code:
> List newList = new ArrayList();
> .
> acroForm.setFields(newList);
> The pdf was saved with the change. But reload the saved pdf file and try to 
> reset some values, it throws NullPointerException, the file can't getFields().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Apache PDFBox October 2015 board report due

2015-10-06 Thread Andreas Lehmkühler


> Maruan Sahyoun  hat am 6. Oktober 2015 um 08:18
> geschrieben:
> 
> 
> Hi,
> > Am 06.10.2015 um 08:07 schrieb Andreas Lehmkuehler :
> > 
> > Am 05.10.2015 um 23:02 schrieb Maruan Sahyoun:
> >> Hi,
> >> 
> >> + 1
> >> 
> >> One thing we might want to address is the large numbers of eMail to dev
> >> because of the commit etc. stuff.
> > Hmmm, I'm not sure that I've got your point. Do you want to explain the high
> > number of mails on dev@ compared to users@?
> 
> yes - as removing the commit messages from dev the traffic as users is higher
> (which is good)
There are many other projects using the dev list the same way so that we don't
have to explain that explicitly.

BR
Andreas

> Maruan
> 
> > 
> >> 
> >> Maruan
> >> 
> >>> Am 05.10.2015 um 19:47 schrieb Andreas Lehmkuehler :
> >>> 
> >>> Hi,
> >>> 
> >>> find attached a quick draft of the board report we're expected to submit
> >>> this
> >>> month. It's based upon the report template which can be found at [1]
> >>> 
> >>> 
> >>> Any further comments, objections or additions?
> >>> 
> >>> 
> >>> 
> >>> 
> >>> Report from the Apache PDFBox committee [Andreas Lehmkühler]
> >>> 
> >>> ## Description:
> >>>   The Apache PDFBox library is an open source Java tool for working with
> >>>   PDF documents.
> >>> 
> >>> ## Activity:
> >>> - after a long time of hard work we decided to cut a release candidate for
> >>>   2.0.0 this october. As we are down to 6 open tickets I'm quite
> >>> optimistic
> >>>   that it'll really come true
> >>> - we joined forces with Tim Allison from Apache TIKA to run some bulk
> >>> tests
> >>>   from time to time to avoid regressions
> >>> 
> >>> ## Health report:
> >>> - there is a steady stream of contributions, bug reports and questions on
> >>>   the mailing lists
> >>> - the core team consists of 4 - 5 active developers
> >>> - we expect to attract more people once our new major release is out of
> >>> the
> >>>   door
> >>> 
> >>> ## Issues:
> >>> - there are no issues requiring board attention at this time"
> >>> 
> >>> ## PMC changes:
> >>> 
> >>> - Currently 16 PMC members.
> >>> - No new PMC members added in the last 3 months
> >>> - Last PMC addition was John Hewson at Thu Feb 06 2014
> >>> 
> >>> ## LDAP changes:
> >>> 
> >>> - Currently 16 committers and 16 committee group members.
> >>> - No new committee group members added in the last 3 months
> >>> - No new committers added in the last 3 months
> >>> - Last committer addition was John Hewson at Fri Feb 07 2014
> >>> 
> >>> ## Releases:
> >>> 
> >>> - 1.8.10 was released on Wed Jul 22 2015
> >>> 
> >>> ## Mailing list activity:
> >>> 
> >>> - us...@pdfbox.apache.org:
> >>>- 497 subscribers (up 6 in the last 3 months):
> >>>- 519 emails sent to list (578 in previous quarter)
> >>> 
> >>> - dev@pdfbox.apache.org:
> >>>- 145 subscribers (down -4 in the last 3 months):
> >>>- 2932 emails sent to list (2594 in previous quarter)
> >>> 
> >>> 
> >>> ## JIRA activity:
> >>> 
> >>> - 151 JIRA tickets created in the last 3 months
> >>> - 143 JIRA tickets closed/resolved in the last 3 months
> >>> 
> >>> 
> >>> 
> >>> 
> >>> BR
> >>> Andreas Lehmkühler
> >>> 
> >>> [1] https://reporter.apache.org/?pdfbox
> >>> 
> >>> 
> >>> 
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>> 
> >> 
> >> 
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >> 
> > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3005) Incorrect property names for lists

2015-10-06 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944721#comment-14944721
 ] 

Maruan Sahyoun commented on PDFBOX-3005:


thanks for your feedback. The check for regressions will take a little longer 
as the list of files is huge.

> Incorrect property names for lists
> --
>
> Key: PDFBOX-3005
> URL: https://issues.apache.org/jira/browse/PDFBOX-3005
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing, XmpBox
>Affects Versions: 2.0.0
>Reporter: Evgeniy Muravitskiy
>Assignee: Maruan Sahyoun
> Fix For: 2.0.0
>
> Attachments: isis09_04.pdf, restest.xml
>
>
> When i write code as follows:
> {code}
> PDDocument document = PDDocument.load(new File(FILE_PATH));
> DomXmpParser parser = new DomXmpParser();
> XMPMetadata metadata = 
> parser.parse(document.getDocumentCatalog().getMetadata().getStream().getUnfilteredStream());
> metadata.removeSchema(metadata.getPDFIdentificationSchema());
> OutputStream res = new FileOutputStream(RESULT_XML);
> new XmpSerializer().serialize(metadata, res, true);
> {code}
> I got xml which contain following tag:
> {code}
> 
> 
> Tomioka, Satoshi
> 
> 
> {code}
> but instead of {{rdf:creator}} must be {{rdf:li}}. This problem reproducible 
> also for others DublinCoreShema properties which contains lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3005) Incorrect property names for lists

2015-10-06 Thread Evgeniy Muravitskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944671#comment-14944671
 ] 

Evgeniy Muravitskiy commented on PDFBOX-3005:
-

Yes, your changes fix problem. Thanks

> Incorrect property names for lists
> --
>
> Key: PDFBOX-3005
> URL: https://issues.apache.org/jira/browse/PDFBOX-3005
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing, XmpBox
>Affects Versions: 2.0.0
>Reporter: Evgeniy Muravitskiy
>Assignee: Maruan Sahyoun
> Fix For: 2.0.0
>
> Attachments: isis09_04.pdf, restest.xml
>
>
> When i write code as follows:
> {code}
> PDDocument document = PDDocument.load(new File(FILE_PATH));
> DomXmpParser parser = new DomXmpParser();
> XMPMetadata metadata = 
> parser.parse(document.getDocumentCatalog().getMetadata().getStream().getUnfilteredStream());
> metadata.removeSchema(metadata.getPDFIdentificationSchema());
> OutputStream res = new FileOutputStream(RESULT_XML);
> new XmpSerializer().serialize(metadata, res, true);
> {code}
> I got xml which contain following tag:
> {code}
> 
> 
> Tomioka, Satoshi
> 
> 
> {code}
> but instead of {{rdf:creator}} must be {{rdf:li}}. This problem reproducible 
> also for others DublinCoreShema properties which contains lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Apache PDFBox October 2015 board report due

2015-10-06 Thread Timo Boehme

Hi,

+1

Best,
Timo


Am 05.10.2015 um 19:47 schrieb Andreas Lehmkuehler:

Hi,

find attached a quick draft of the board report we're expected to submit
this
month. It's based upon the report template which can be found at [1]


Any further comments, objections or additions?




Report from the Apache PDFBox committee [Andreas Lehmkühler]

## Description:
The Apache PDFBox library is an open source Java tool for working with
PDF documents.

## Activity:
  - after a long time of hard work we decided to cut a release candidate
for
2.0.0 this october. As we are down to 6 open tickets I'm quite
optimistic
that it'll really come true
  - we joined forces with Tim Allison from Apache TIKA to run some bulk
tests
from time to time to avoid regressions

## Health report:
  - there is a steady stream of contributions, bug reports and questions on
the mailing lists
  - the core team consists of 4 - 5 active developers
  - we expect to attract more people once our new major release is out
of the
door

## Issues:
  - there are no issues requiring board attention at this time"

## PMC changes:

  - Currently 16 PMC members.
  - No new PMC members added in the last 3 months
  - Last PMC addition was John Hewson at Thu Feb 06 2014

## LDAP changes:

  - Currently 16 committers and 16 committee group members.
  - No new committee group members added in the last 3 months
  - No new committers added in the last 3 months
  - Last committer addition was John Hewson at Fri Feb 07 2014

## Releases:

  - 1.8.10 was released on Wed Jul 22 2015

## Mailing list activity:

  - us...@pdfbox.apache.org:
 - 497 subscribers (up 6 in the last 3 months):
 - 519 emails sent to list (578 in previous quarter)

  - dev@pdfbox.apache.org:
 - 145 subscribers (down -4 in the last 3 months):
 - 2932 emails sent to list (2594 in previous quarter)


## JIRA activity:

  - 151 JIRA tickets created in the last 3 months
  - 143 JIRA tickets closed/resolved in the last 3 months




BR
Andreas Lehmkühler

[1] https://reporter.apache.org/?pdfbox



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4  | fax: +49 345 478 047 1
email: ulf.la...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org