Hello Daniel,

 

The text ("2007") we need to extract is written in CLRDingbats font. Can you 
please give us any pointer so that we don't get garbage value while extracting 
it from the pdf (attached for your reference)?

 

Thanks!

 

From: Daniel Manzke [mailto:[email protected]] 
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; [email protected]
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Sorry this would be a job for one of the pdfbox developers. Until now I'm just 
doing some support for the list and didn't have too much know-how about it.

 

So I can just have a look in the evening and maybe I will find a solution. ;)

 

 

Daniel

 

2008/12/29 Duseja, Sushil <[email protected]>

If possible, can you please let us know your contact number to discuss this 
issue?

 

Thanks!

 

From: Daniel Manzke [mailto:[email protected]] 
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; [email protected]
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Hi,

 

I've just added this line:

 

//after stripper.extractRegions();

stripper.getText(document));

 

After doing this I got some text for the regions. But it seems that this text 
is related to page 1. Did you have found an example how to use the Stripper? 
Maybe another guy could help you, due the fact that I don't have any knowledge 
about the Stripper.

 

If I have some time in the evening I will give it another test. 

 

 

Bye,

Daniel

2008/12/29 Duseja, Sushil <[email protected]>

Hello Daniel,

 

I tried using the compiled version sent across by you with no luck.

 

I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8 
versions in the classpath separately. With 0.8, I am not being able to fetch 
anything. However with 0.7.3, I could extract all values apart from "Year of 
Form"  whose value is garbage - À¾´» , which is why you recommended using 0.8.

 

Note - Java program and my PDF are attached for your kind reference. The names 
of the java files are self explanatory and indicative of which version they are 
using. The contents of these java files are exactly the same.

 

Please advise.

 

Thanks!

 

From: Daniel Manzke [mailto:[email protected]] 
Sent: Monday, December 29, 2008 2:45 PM


To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output

 

Just check out the latest source code and run Maven.

 

I will send you a compiled version.

 

 

Bye

2008/12/29 Duseja, Sushil <[email protected]>

Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the 
Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am 
not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:[email protected]] 
Sent: Monday, December 29, 2008 1:41 PM


To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you 
could do, is downloading the source code and build it by your own. So you could 
have a look at the code and debug it, where the garbage is produced. Or ask me 
and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <[email protected]>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be 
downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; 
however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:[email protected]] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is 
outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <[email protected]>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:[email protected]]
Sent: Friday, December 26, 2008 9:11 PM
To: [email protected]
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <[email protected]>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke

Reply via email to