[jira] [Commented] (PDFBOX-448) Columns in text not extracted separately.

Mel Martinez (JIRA) Thu, 21 Jul 2011 07:42:23 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069005#comment-13069005
 ]


Mel Martinez commented on PDFBOX-448:
-------------------------------------

By default PDFTextStripper has it's "shouldSeparateByBeads" attribute set to 
"true" which means that it will try to extract the text flowing from one column 
to another as contiguous text.   Thus it will extract/render the text from 
column 1 first followed by the text for column 2. 

If you set that flag to 'false', the stripper will try to extract the beads in 
rendered order,  'rendering' the vertically correlated lines from each column 
side by side --- i.e, in the same line.

However the text extraction does not currently demark when the text in the line 
is no longer in the first bead and now coming from the 2nd.  So currently it is 
not possible to tell which words in the line came from which column.

The writePage() code detects a gap in a line of words and inserts the singleton 
WordSeparator object between words.   When the text is 'rendered' it is 
replaced with the return value of the 'getWordSeparator()' method (which can be 
modified using the 'setWordSeparator(String)' method).   It may be possible to 
do something similar with detecting the bead change.

I.E. - if we detect that we just incremented the bead count since the last 
insert of a WordSeparator, we could also insert a 'BeadSeparator'.    We could 
then similarly instrument the ability to customize what string is used to 
render the BeadSeparator (it would default to be an empty string to maintain 
the current behavior).

I unfortunately do not have time to work on this myself right now.   If someone 
else wants to run with this idea and try to implement it, that would be cool.

For most users, the default behavior of 'shouldSeparateByBeads==true' 
accomplishes what is needed because it tries to keep the text logically 
contiguous.  Are you sure this isn't what you want?


> Columns in text not extracted separately.  
> -------------------------------------------
>
>                 Key: PDFBOX-448
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-448
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Brian Carrier
>         Attachments: WBPaper00003120.pdf
>
>
> The paper that is attached to PDFBOX-80 has two columns of text, but the 
> extracted text is not separated by column.  Instead it combines the text in 
> each column on each line. 
> PDFTextStripper has a notion of columns and "articles / beads", but they are 
> not being used with this file.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-448) Columns in text not extracted separately.

Reply via email to