RE: [PDFdev] unexpected stream content

Aandi Inston Sat, 25 Oct 2003 05:03:36 -0700

PDFdev is a service provided by PDFzone.com | http://www.pdfzone.com
_____________________________________________________________


> Sorry about the lack of understanding but I expected something
> more like: [(some text)ddd]. Have I had my assumptions totally wrong?

You are assuming way too much. Sometimes, often in fact, you will see
that. But that is just good luck.

You need to take each byte value, look up the font, and process
the encoding values.  You also have to deal with split strings and
out of order text. You will need to be very familiar with the
chapter on text/fonts in the PDF Reference in particular, but you'll
need to have read it in detail up to that point too.

Believe me when I say extracting text from a PDF is one of the more
difficult problems, and may represent many months of work, even once
you have a full grasp of the problem. Bear in mind that you cannot
do a perfect solution, partly because the encodings are not always
present, and partly because deducing reading order is guesswork.

Hence, unless you really have to do it yourself, or really want to
solve the issues (it's an interesting challenge), I would recommend
looking into

I mean I do not know what all this text line should look like. I believe
this should read "It's the 21st century" although again, I am not 100% sure.

> May I add it is my first atempt to decode PDF stuff,
> and I do feel like I am stepping into something new to put it mildly...

Not a good choice for a first attempt, I suspect...

Aandi


To change your subscription:
http://www.pdfzone.com/discussions/lists-pdfdev.html

RE: [PDFdev] unexpected stream content

Reply via email to