Dear sir/madam
I'm a chinese student. I want to use PDFbox to do some research in PDF 
extraction.
Now the most important thing for me is to extract the structurual information 
from PDFs. I know PDFbox is very powerfull. But  I do not know how to extract 
the information from a pdf. I've extract the plain txt from a pdf using PDFbox. 
And the plain txt can't satisfy my demand. For natural language processing, I 
need parsing the PDF, so I should not only extract the txt information, but 
also get the PDF's structure that means I should get the all the tags like 
Tj、Tm in a PDF. PDFbox has lots of APIs, I don't know how to get the value from 
every tag of each PDFobject. I know in PDF some tags in it, just like Tj、Tm and 
so on. I hope get every PDFobject's structural information just like 
font、fontsize and so on, so I can obtain some pattern just like the max font, 
and then I can find the "title" of each paper. To the object which has the 
content stream, i hope to decode the stream. Finally, I can abtain the object's 
pattern which  has content stream, then I can classify the objects to find 
which category I need.
Do you think its possible?
Could you give me some example to extract PDF, specially the extraction the 
object with stream, find max font-size object and decode the stream. I hope you 
can provide me some source codes extracting pdfs using PDFbox. Not just 
stripper.getText().
Thanks a billion!!! I hope you write to me soon!!!
sincerely,
 
dock CHEN

Reply via email to