I suppose I should clarify; I was not bemoaning the lack of support or generosity, I was stating the situation as it had occurred to that point - one gentleman had offered support (and through no fault of his own) added more questions than answers. I did not mean to come off as impatient, I am always thankful and never question the delivery of free knowledge. ________________________________________ From: Andreas Lehmkuehler [[email protected]] Sent: Sunday, May 20, 2012 7:02 AM To: [email protected] Subject: Re: PDFBox View Post-Read, Pre-Conversion Stream
Hi, Am 20.05.2012 01:31, schrieb Hawkins, Thomas A. - Student: > I've asked this question a couple of times and I really need help - no one > has really > given me any type of answer that I can use. I've had answers but they > point me in no positive direction. No offense, but you have to be more patient, we are all volunteers ... > I am converting pdf files to txt files (of course I lose the formatting), > but I get horrible results converting to html and even worse to XML. > > So what I want to do, is have the program either place a space between > superscript exponents, or, place exponents in brackets. > > Is there anyway for me to access the stream of data after the pdf is read, > but before it is converted to a string. If I can find a way to do this > then I can figure out how to edit the data to return the txt file I want. It is not that easy. - the information you are looking for is part of the so called contentstream - that stream is processed within PDFStreamEngine#processStream [1] - the main test-processing is done in PDFStreamEngine#processEncodedText - the PDF-operator -> ProcessOperator mapping can be found here [2] - the class TestPosition doesn't have any onformation about text features like superscript - you might have a look at the pdf specs [3] > I am using the .NET port of pdfBox and I would appreciate some > examples (preferably VB or C#) but Java was my first language and > I'm sure I can knock the dust off of my knowledge. As it is complicated enough to implement this stuff in java, I guess there won't be any approaches in VB or C#. BR Andreas Lehmkühler [1] http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java [2] http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/PDFTextStripper.properties [3] http://www.adobe.com/de/devnet/pdf/pdf_reference.html

