Dear sir/madam
I'm a chinese student. I want to use PDFbox to do some research in PDF
extraction.
Now the most important thing for me is to extract the structurual information
from PDFs. I know PDFbox is very powerfull. But I do not know how to extract
the information from a pdf. I've extract the plain txt from a pdf using PDFbox.
And the plain txt can't satisfy my demand. For natural language processing, I
need parsing the PDF, so I should not only extract the txt information, but
also get the PDF's structure that means I should get the all the tags like
Tj、Tm in a PDF. PDFbox has lots of APIs, I don't know how to get the value from
every tag of each PDFobject. I know in PDF some tags in it, just like Tj、Tm and
so on. I hope get every PDFobject's structural information just like
font、fontsize and so on, so I can obtain some pattern just like the max font,
and then I can find the "title" of each paper. To the object which has the
content stream, i hope to decode the stream. Finally, I can abtain the object's
pattern which has content stream, then I can classify the objects to find
which category I need.
Do you think its possible?
Could you give me some example to extract PDF, specially the extraction the
object with stream, find max font-size object and decode the stream. I hope you
can provide me some source codes extracting pdfs using PDFbox. Not just
stripper.getText().
Thanks a billion!!! I hope you write to me soon!!!
sincerely,
dock CHEN