date:20130327

Re: Overhaul PDFBox site

2013-03-27 Thread Andreas Lehmkuehler


Hi,

Am 27.03.2013 07:23, schrieb Maruan Sahyoun:

thx for the offer to help. I think it's needed :-)

I'm happy that someone is eager enough to do some of the not so cool things
like documentation. ;-)


I already read about the Apache CMS and svnpubsub and think that the CMS is

> the way to go although it's initially a little more effort. One of the major
> benefits of the CMS is that non technical users can use it (the web UI) and
> it's easier for non comitters to contribute [1].
+1, nothing to add here


As soon as I get a go to move forward I'll open a ticket on Jira to track the

> status of the move. The initial step will be to get myself familiar with the
> tools to build the site as described in [2].
It looks like people gathered a lot of CMS related information so that you
should find everything you need. If some piece is missing try to ask for it
on one of the mentioned mailing lists.

Maybe some of the steps will need PMC-chair power (I saw a form to request the
svn space for the CMS which is limited to chairs) so that I've to step in.

IMHO the following steps have to be done (this list is not exhaustive)

- migrate the existing xdoc files to markdown (maybe there is a script doing
that?)
- create a new content structure following the CMS needs (there is a minimal
example on the cms-wiki-page)
- get in touch with infra to request all the needed things to do the transition 
(AFAIU several steps are needed)



> I propose the migration to reuse
> the current content and most of the current navigation and optimize at a later
> stage but making a clearer distinction between users of and developers for 
pdfbox.

I'm not sure which way is the best, but I'm sure you'll find out. :-)


Maruan Sahyoun

[1] http://www.apache.org/dev/cmsref.html#non-committer
[2] http://www.apache.org/dev/cmsref.html


BR
Andreas Lehmkühler

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-27 Thread Vitalie Bureanu (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614995#comment-13614995
 ] 

Vitalie Bureanu commented on PDFBOX-1542:
-

Thank you very much, Andreas! We will try to use PDFStripper to insert white 
spaces!

> Whitespaces between words are not created
> -
>
> Key: PDFBOX-1542
> URL: https://issues.apache.org/jira/browse/PDFBOX-1542
> Project: PDFBox
>  Issue Type: Wish
>  Components: Text extraction
>Affects Versions: 1.7.1
>Reporter: Vitalie Bureanu
>Priority: Minor
> Attachments: Parser.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hello, I extract the text with PDFBox from PDF files. I noticed that 
> extraction of text from some pdf files are not so good as expected. I have a 
> seria of pdf invoices from which I try to extract the text with coordinates 
> and resultat is pretty well, but I noticed very strange thing: when I extract 
> text - the words are extracted without whitespaces bettween. Example: if I 
> try to extract "Unit Price" the result is "UnitPrice".
> But if I open the invoice in Adobe Reader and make "Copy/Past" into 
> Notepad... I have the "Unit Price" with whitespaces!
> I think the whitespaces are not present in original pdf document... but the 
> Adobe Reader in some way "insert" whitespaces between words when it show 
> content of the pdf.
>  
> Guys, can you please suggest me how I can have the strings with spaces after 
> the parsing? 
> See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
> PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
> Many thanks,
> Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Overhaul PDFBox site

2013-03-27 Thread Timo Boehme


Hi,

moving to the CMS is IMHO a good idea.
To speed up the process of getting consensus here are my
+1
for doing the transition. Thanks Maruan for volunteering for this task.


Best,
Timo


Am 27.03.2013 00:10, schrieb Andreas Lehmkuehler:

Hi,

Am 26.03.2013 23:04, schrieb Maruan Sahyoun:

would be happy to handle that

Cool! I'll try to help whenever possible.

OK, I guess we don't need a formal vote on moving our site to the CMS, but
let's wait a couple of days so that everybody has a chance to object.

@Maruan
Once we have lazy consensus we/you can start with the preparations.
Please try
to find out how we should start/proceed. I hope you'll find all you need
using the pointers I gave in my earlier post



Maruan Sahyoun

Am 26.03.2013 um 22:35 schrieb Andreas Lehmkuehler :


Hi,

Am 26.03.2013 17:00, schrieb Maruan Sahyoun:

well - the navigation is similar also hidden behind drop downs on ode
compared to cloudstack. Both are using the same css framework [1]
and the
navigation can even be combined - that should give us enough freedom
(and
is an implementation detail). Both seem to be using the  Apache CMS
[2].

I guess we all know that we have to overhaul the content itself. :-)

But first of all we have to decide how to manage the content. We have
to use
either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO
we should use the CMS [2] as it would be more flexible and it is
easier to
maintain the content.

As a good starting point I've changed the maven skin of our site to the
bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up
the layout
a little bit in preparation of a possible transition to the CMS.

WDYT and the more interesting question any volunteer to handle the
transition?

BR
Andreas Lehmkühler

[1] http://www.apache.org/dev/project-site.html
[2] http://www.apache.org/dev/cmsref.html
[3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html


BR
Andreas Lehmkühler




--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_

[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

Vitalie Bureanu created PDFBOX-1553:
---

 Summary: Offset of extracted coordinates
 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu


Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf "char by 
char" with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange "offset" for extracted coordinates. (see screens)

The offset is incremental - at left top corner of document is near to real 
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset "in action" you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Priority: Minor  (was: Major)

> Offset of extracted coordinates
> ---
>
> Key: PDFBOX-1553
> URL: https://issues.apache.org/jira/browse/PDFBOX-1553
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Ubuntu 64 bit, Java
>Reporter: Vitalie Bureanu
>Priority: Minor
>  Labels: offset
> Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
> coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all 
> developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char 
> by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf 
> documents we have a strange "offset" for extracted coordinates. (see screens)
> The offset is incremental - at left top corner of document is near to real 
> coordinates of charcater, but at right bottom corner is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at 
> http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Attachment: Selection in Adobe Reader.png
Extracted coordinates of rects.jpg
Parser.java
EnSt11_offset.pdf
EnSt10_offset.pdf

> Offset of extracted coordinates
> ---
>
> Key: PDFBOX-1553
> URL: https://issues.apache.org/jira/browse/PDFBOX-1553
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Ubuntu 64 bit, Java
>Reporter: Vitalie Bureanu
>  Labels: offset
> Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
> coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all 
> developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char 
> by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf 
> documents we have a strange "offset" for extracted coordinates. (see screens)
> The offset is incremental - at left top corner of document is near to real 
> coordinates of charcater, but at right bottom corner is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at 
> http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Description: 
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf "char by 
char" with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange "offset" for extracted rect coordinates. (see 
screens)

The offset is seems to be incremental (not sure) - at left top corner of 
document is near to real coordinates of character, but at right bottom corner 
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset "in action" you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?


  was:
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf "char by 
char" with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange "offset" for extracted coordinates. (see screens)

The offset is incremental - at left top corner of document is near to real 
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset "in action" you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)





> Offset of extracted coordinates
> ---
>
> Key: PDFBOX-1553
> URL: https://issues.apache.org/jira/browse/PDFBOX-1553
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Ubuntu 64 bit, Java
>Reporter: Vitalie Bureanu
>Priority: Minor
>  Labels: offset
> Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
> coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all 
> developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char 
> by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf 
> documents we have a strange "offset" for extracted rect coordinates. (see 
> screens)
> The offset is seems to be incremental (not sure) - at left top corner of 
> document is near to real coordinates of character, but at right bottom corner 
> is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at 
> http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
> Please can you test these files and tell me if it is a really bug?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Description: 
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf "char by 
char" with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange "offset" for extracted rect coordinates. (see 
screens)

The offset is seems to be incremental (not sure) - at left top corner of 
document is near to real coordinates of character, but at right bottom corner 
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset "in action" you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?
How we can resolve it?

Thanks,
Vitalie


  was:
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf "char by 
char" with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange "offset" for extracted rect coordinates. (see 
screens)

The offset is seems to be incremental (not sure) - at left top corner of 
document is near to real coordinates of character, but at right bottom corner 
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset "in action" you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?



> Offset of extracted coordinates
> ---
>
> Key: PDFBOX-1553
> URL: https://issues.apache.org/jira/browse/PDFBOX-1553
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Ubuntu 64 bit, Java
>Reporter: Vitalie Bureanu
>Priority: Minor
>  Labels: offset
> Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
> coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all 
> developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char 
> by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf 
> documents we have a strange "offset" for extracted rect coordinates. (see 
> screens)
> The offset is seems to be incremental (not sure) - at left top corner of 
> document is near to real coordinates of character, but at right bottom corner 
> is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at 
> http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
> Please can you test these files and tell me if it is a really bug?
> How we can resolve it?
> Thanks,
> Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PDFBOX-1547) TextPosition.getX() and getY() do not work properly with CropBox

2013-03-27 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1547.


   Resolution: Fixed
Fix Version/s: 1.9.0
 Assignee: Andreas Lehmkühler

You are right, we have to use the cropbox instead of the mediabox. The 
PDGraphicsState already uses the cropbox. I've fixed that in revision 1461796.

Thanks for the report and the pointer!

> TextPosition.getX() and getY() do not work properly with CropBox
> 
>
> Key: PDFBOX-1547
> URL: https://issues.apache.org/jira/browse/PDFBOX-1547
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Juraj Lonc
>Assignee: Andreas Lehmkühler
> Fix For: 1.9.0
>
> Attachments: redig_test_crop3.pdf
>
>
> TextPosition.getX() and getY() are supposed to calculate position relative to 
> upper left corner of page.
> When PDF contains CropBox then these functions return incorrect values. 
> CropBox is ignored.
> Text is relative to CropBox coordinates but calculations are made only with 
> pageWidth and pageHeight, and that is wrong.
> "page" in function description means MediaBox or CropBox?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615638#comment-13615638
 ] 

Andreas Lehmkühler commented on PDFBOX-1553:


Maybe your issue is related to PDFBOX-1547 as your pdf has a cropbox too. Can 
you check this?

> Offset of extracted coordinates
> ---
>
> Key: PDFBOX-1553
> URL: https://issues.apache.org/jira/browse/PDFBOX-1553
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Ubuntu 64 bit, Java
>Reporter: Vitalie Bureanu
>Priority: Minor
>  Labels: offset
> Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
> coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all 
> developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char 
> by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf 
> documents we have a strange "offset" for extracted rect coordinates. (see 
> screens)
> The offset is seems to be incremental (not sure) - at left top corner of 
> document is near to real coordinates of character, but at right bottom corner 
> is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at 
> http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
> Please can you test these files and tell me if it is a really bug?
> How we can resolve it?
> Thanks,
> Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1402) Improve handling of multiline text boxes

2013-03-27 Thread Will May (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will May updated PDFBOX-1402:
-

Attachment: (was: PDAppearanceTest.java)

> Improve handling of multiline text boxes
> 
>
> Key: PDFBOX-1402
> URL: https://issues.apache.org/jira/browse/PDFBOX-1402
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 1.7.1
>Reporter: Will May
> Attachments: multiLine.patch
>
>
> The current implementation for setting the appearance of content that is 
> added to a multiline text box is incorrect in a number of ways:
> * Doesn't position the start of the text in the correct location
> * Incorrectly uses font size '0' instead of auto-sizing the font
> * Doesn't break up very long lines
> * If the font size is very large, then the next line is started too close to 
> the previous line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1402) Improve handling of multiline text boxes

2013-03-27 Thread Will May (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will May updated PDFBOX-1402:
-

Attachment: (was: test.pdf)

> Improve handling of multiline text boxes
> 
>
> Key: PDFBOX-1402
> URL: https://issues.apache.org/jira/browse/PDFBOX-1402
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 1.7.1
>Reporter: Will May
> Attachments: multiLine.patch
>
>
> The current implementation for setting the appearance of content that is 
> added to a multiline text box is incorrect in a number of ways:
> * Doesn't position the start of the text in the correct location
> * Incorrectly uses font size '0' instead of auto-sizing the font
> * Doesn't break up very long lines
> * If the font size is very large, then the next line is started too close to 
> the previous line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1402) Improve handling of multiline text boxes

2013-03-27 Thread Will May (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615790#comment-13615790
 ] 

Will May commented on PDFBOX-1402:
--

Just realised that the example I attached isn't a very good one and the PDF I 
used to test the change against is not able to be freely distributed.
The patch still applies okay. Is there anyone who would be able to apply this 
patch?

> Improve handling of multiline text boxes
> 
>
> Key: PDFBOX-1402
> URL: https://issues.apache.org/jira/browse/PDFBOX-1402
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 1.7.1
>Reporter: Will May
> Attachments: multiLine.patch
>
>
> The current implementation for setting the appearance of content that is 
> added to a multiline text box is incorrect in a number of ways:
> * Doesn't position the start of the text in the correct location
> * Incorrectly uses font size '0' instead of auto-sizing the font
> * Doesn't break up very long lines
> * If the font size is very large, then the next line is started too close to 
> the previous line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

2013-03-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615797#comment-13615797
 ] 

Michael McCandless commented on PDFBOX-1273:


Looks like this is the same issue as TIKA-1098.

> java.io.IOException: Error: Unknown annotation type null
> 
>
> Key: PDFBOX-1273
> URL: https://issues.apache.org/jira/browse/PDFBOX-1273
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.7.0
>Reporter: William
>Priority: Minor
> Attachments: PDPageQuickFix.patch
>
>
> Hi,
> I've come across the following exception on a very small number of documents:
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:80) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDFParser.parse(PDFParser.java:116) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
> ~[extractor.jar:na]
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
> ~[extractor.jar:na]
> Caused by: java.io.IOException: Error: Unknown annotation type null
> at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
>  ~[extractor.jar:na]
> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:785) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.endPage(PDF2XHTML.java:142) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) 
> ~[extractor.jar:na]
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) 
> ~[extractor.jar:na]
> at org.apache.pdfbox.tika.PDF2XHTML.process(PDF2XHTML.java:63) 
> ~[extractor.jar:na]
> Here are a few examples:
> http://www.jdsupra.com/documents/01ece854-a961-4184-8de7-f6d5311d6a48.pdf
> http://www.jdsupra.com/documents/0aabecb4-094a-40e4-a507-8b49ecb90a3e.pdf
> http://www.jdsupra.com/documents/0d74ccf8-2d57-487d-88c2-98eee26f8236.pdf
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Overhaul PDFBox site

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

Re: Overhaul PDFBox site

[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Resolved] (PDFBOX-1547) TextPosition.getX() and getY() do not work properly with CropBox

[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1402) Improve handling of multiline text boxes

[jira] [Updated] (PDFBOX-1402) Improve handling of multiline text boxes

[jira] [Commented] (PDFBOX-1402) Improve handling of multiline text boxes

[jira] [Commented] (PDFBOX-1273) java.io.IOException: Error: Unknown annotation type null

14 matches

Site Navigation

Mail list logo

Footer information