I suppose I should clarify; I was not bemoaning the lack of support or 
generosity, I was stating the situation as it had occurred to that point - one 
gentleman had offered support (and through no fault of his own) added more 
questions than answers. I did not mean to come off as impatient, I am always 
thankful and never question the delivery of free knowledge.
________________________________________
From: Andreas Lehmkuehler [[email protected]]
Sent: Sunday, May 20, 2012 7:02 AM
To: [email protected]
Subject: Re: PDFBox View Post-Read, Pre-Conversion Stream

Hi,

Am 20.05.2012 01:31, schrieb Hawkins, Thomas A. - Student:
> I've asked this question a couple of times and I really need help - no one 
> has really
 > given me any type of answer that I can use. I've had answers but they
 > point me in no positive direction.
No offense, but you have to be more patient, we are all volunteers ...

> I am converting pdf files to txt files (of course I lose the formatting),
 > but I get horrible results converting to html and even worse to XML.
>
> So what I want to do, is have the program either place a space between
 > superscript exponents, or, place exponents in brackets.
>
> Is there anyway for me to access the stream of data after the pdf is read,
 > but before it is converted to a string. If I can find a way to do this
 > then I can figure out how to edit the data to return the txt file I want.
It is not that easy.

- the information you are looking for is part of the so called contentstream
- that stream is processed within PDFStreamEngine#processStream [1]
- the main test-processing is done in PDFStreamEngine#processEncodedText
- the PDF-operator -> ProcessOperator mapping can be found here [2]
- the class TestPosition doesn't have any onformation about text features like
superscript
- you might have a look at the pdf specs [3]


> I am using the .NET port of pdfBox and I would appreciate some
 > examples (preferably VB or C#) but Java was my first language and
 > I'm sure I can knock the dust off of my knowledge.
As it is complicated enough to implement this stuff in java, I guess
there won't be any approaches in VB or C#.

BR
Andreas Lehmkühler

[1]
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java
[2]
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/PDFTextStripper.properties
[3] http://www.adobe.com/de/devnet/pdf/pdf_reference.html

Reply via email to