[
https://issues.apache.org/jira/browse/PDFBOX-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069005#comment-13069005
]
Mel Martinez commented on PDFBOX-448:
-------------------------------------
By default PDFTextStripper has it's "shouldSeparateByBeads" attribute set to
"true" which means that it will try to extract the text flowing from one column
to another as contiguous text. Thus it will extract/render the text from
column 1 first followed by the text for column 2.
If you set that flag to 'false', the stripper will try to extract the beads in
rendered order, 'rendering' the vertically correlated lines from each column
side by side --- i.e, in the same line.
However the text extraction does not currently demark when the text in the line
is no longer in the first bead and now coming from the 2nd. So currently it is
not possible to tell which words in the line came from which column.
The writePage() code detects a gap in a line of words and inserts the singleton
WordSeparator object between words. When the text is 'rendered' it is
replaced with the return value of the 'getWordSeparator()' method (which can be
modified using the 'setWordSeparator(String)' method). It may be possible to
do something similar with detecting the bead change.
I.E. - if we detect that we just incremented the bead count since the last
insert of a WordSeparator, we could also insert a 'BeadSeparator'. We could
then similarly instrument the ability to customize what string is used to
render the BeadSeparator (it would default to be an empty string to maintain
the current behavior).
I unfortunately do not have time to work on this myself right now. If someone
else wants to run with this idea and try to implement it, that would be cool.
For most users, the default behavior of 'shouldSeparateByBeads==true'
accomplishes what is needed because it tries to keep the text logically
contiguous. Are you sure this isn't what you want?
> Columns in text not extracted separately.
> -------------------------------------------
>
> Key: PDFBOX-448
> URL: https://issues.apache.org/jira/browse/PDFBOX-448
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Brian Carrier
> Attachments: WBPaper00003120.pdf
>
>
> The paper that is attached to PDFBOX-80 has two columns of text, but the
> extracted text is not separated by column. Instead it combines the text in
> each column on each line.
> PDFTextStripper has a notion of columns and "articles / beads", but they are
> not being used with this file.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira