Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Ismael Hasan Tue, 21 Jul 2009 08:50:14 -0700

Hi Andreas,

thanks for answering. I am now loading the document as you suggested.


I will rewrite the question, since I have been doing some testing:  I
understand that the code in my first message should return the same
result for both of the regions, since they are defined with the same
parameters, but it does not. Testing it with more regions, it only
retrieves the text from one:

I divide a page in 4 regions and add the regions to the stripper in
the following order:
1-upper left, 2-upper right, 3-lower left, 4-lower right.

After calling "extractRegions" function, only the text for the third
one is retrieved.
If I donnot add the third region, only the text for region 2 is retrieved.


I think this behaviour is strange, and it may not be the expected. In
the example you suggested,
'org.apache.pdfbox.examples.util.ExtractTextByArea', only one region
is defined, so maybe the tool is not intended to extract several
regions at a time.

Any answer will be appreciated,

thanks in advance,

Ismael

2009/7/21 Andreas Lehmkühler <[email protected]>:
> Hi Ismael,
>
> first of all try to load the pdf with PDDocument doc = PDDocument.load(file). 
> You don't have to parse the doc on your own. See 
> org.apache.pdfbox.examples.util.ExtractTextByArea as an example for 
> extracting textareas.
> Why do you try to extract the same region twice? Wouldn't it be easier to 
> just copy the resultstring?
>
> BR
> Andreas Lehmkühler
>
> ----- original Nachricht --------
>
> Betreff: PDFTextStripperByArea extracts text only from 1 region, despite  
> several regions being defined
> Gesendet: Di, 21. Jul 2009
> Von: Ismael Hasan<[email protected]>
>
>> Hello. I have a problem with the class
>> "org.apache.pdfbox.util.PDFTextStripperByArea":
>>
>> If I add several regions to this class to extract the text from, it is
>> only retrieved from one of them. The example I build was to create two
>> regions with the same values (with different names), add them to the
>> text stripper, and use the "extractRegions" function.
>>
>> I really appreciate if someone can answer me what I am doing wrong, or
>> if this is a bug in the tool.
>>
>> Please, see at the end of the message the code with which I get this
>> issue; the final result buffers (localResult1 and localResult2) have
>> different content (one of them is empty). If you need a PDF document
>> to reproduce this, please ask me for it.
>>
>> Thanks in advance,
>> Ismael
>>
>>
>>
>> //Opening the document and getting the page
>> PDFParser parser = new PDFParser(new
>> ByteArrayInputStream(documentInBytes));
>> parser.parse();
>> PDDocument doc = parser.getPDDocument();
>> PDPage page = (PDPage)
>> doc.getDocumentCatalog().getAllPages().get(pageNumber);
>>
>> // Creating the stripper
>> PDFTextStripperByArea areaStripper = new PDFTextStripperByArea();
>>
>> // Creation and addition of the regions to the stripper
>> Rectangle2D rectangle = new Rectangle2D.Float();
>> rectangle.setRect(0, 0, 500, 100);
>> areaStripper.addRegion("1", rectangle);
>>
>> Rectangle2D rectangle2 = new Rectangle2D.Float();
>> rectangle2.setRect(0, 0, 500, 100);
>> areaStripper.addRegion("2", rectangle2);
>>
>> // Extracting the regions and getting the results
>> areaStripper.extractRegions(page);
>> String localResult1 = areaStripper.getTextForRegion("1");
>> String localResult2 = areaStripper.getTextForRegion("2");
>>
>
> --- original Nachricht Ende ----
>
>

Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Reply via email to