Re: [iText-questions] NPE while Extracting text

2010-06-21 Thread Leonard Rosenthol
There are two ways to handle Type 3 encodings.

1) It's a newer Type3 and has an associated ToUnicode table - that's easy ;).

2) Use the name of the glyph (the key in the CharProcs table) against the Adobe 
Glyph List (<http://en.wikipedia.org/wiki/Adobe_Glyph_List>) which maps 
standard names to Unicode values.

Leonard

-Original Message-
From: Kevin Day [mailto:ke...@trumpetinc.com] 
Sent: Monday, June 21, 2010 5:52 PM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] NPE while Extracting text


The trick here is obtaining a mapping between the type 3 font glyphs and some
sort of encoded text.  There are several ways that this can be done, and
they are fairly well supported by the text parser - but type 3 fonts, as has
been mentioned, don't *usually* have this sort of mapping information.

I know a lot of the PDF specification, but I don't know all of it - and it's
quite possible that there is some mechanism for obtaining this sort of
mapping.  I guess the first thing to do is to ask whether Acrobat can figure
the text out for these fonts (can you hi-light the text, copy and paste it
into a text editor?).  If they can, then it's time to dig into the PDF spec
and figure out if there is some mapping strategy that isn't being handled by
CMapAwareDocumentFont.

What it sounds like to me is that the string that is passed into decode() is
actually correct.  Interestingly, looking at the font definition that you
provide, there is a dictionary entry for Encoding.  I think that this is
where careful reading of the PDF spec is going to be required - so here are
some resources to get you started:

Here's the spec:  http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

Section 9.6.5 discusses type 3 font dictionaries.


I note that Type 3 fonts *can* have a ToUnicode entry.  And they have an
Encoding entry.  So these sure sound an aweful lot like Type 1 fonts as far
as text extraction is concerned.  From a debugging perspective, I think that
the next step is to do a debug walk through with a document containing
normal Type 1 font, and comparing that with the walkthough of your document
with Type 3 font.  You may find that there's something subtle that can be
tweaked to make this work.

Please let me know what you find!
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2262853.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-21 Thread Kevin Day

The trick here is obtaining a mapping between the type 3 font glyphs and some
sort of encoded text.  There are several ways that this can be done, and
they are fairly well supported by the text parser - but type 3 fonts, as has
been mentioned, don't *usually* have this sort of mapping information.

I know a lot of the PDF specification, but I don't know all of it - and it's
quite possible that there is some mechanism for obtaining this sort of
mapping.  I guess the first thing to do is to ask whether Acrobat can figure
the text out for these fonts (can you hi-light the text, copy and paste it
into a text editor?).  If they can, then it's time to dig into the PDF spec
and figure out if there is some mapping strategy that isn't being handled by
CMapAwareDocumentFont.

What it sounds like to me is that the string that is passed into decode() is
actually correct.  Interestingly, looking at the font definition that you
provide, there is a dictionary entry for Encoding.  I think that this is
where careful reading of the PDF spec is going to be required - so here are
some resources to get you started:

Here's the spec:  http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

Section 9.6.5 discusses type 3 font dictionaries.


I note that Type 3 fonts *can* have a ToUnicode entry.  And they have an
Encoding entry.  So these sure sound an aweful lot like Type 1 fonts as far
as text extraction is concerned.  From a debugging perspective, I think that
the next step is to do a debug walk through with a document containing
normal Type 1 font, and comparing that with the walkthough of your document
with Type 3 font.  You may find that there's something subtle that can be
tweaked to make this work.

Please let me know what you find!
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2262853.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-21 Thread Mike Marchywka






> Date: Mon, 21 Jun 2010 09:49:44 +0100
> From: b...@benshort.co.uk
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] NPE while Extracting text
>
> Thanks very much for this information.
>
> Maybe you could offer me some direction of how to solve my problem?
>
> I need to parse pdf mobile phone bills. the information i require is
> the itemized data that is in a table format. Is this possible with
> itextpdf?


I know this won't help you but let's be clear- pdf is NOT the format
of choice for DATA or INFORMATION. It is generally about
human readability- and while this often has a describable structure,
everyone here tells me it is too complicated to include that in the
PDF file. If you have a choice, and have a cooperative relationship
with the source of the documents, you want an INFORMATION
format, not a bunch of pixels. "Scraping" html or pdf is
often done by people trying to extract information from artwork
but you always need to make assumptions about the document
structure. If you want a robust means to do this,
at least workout some conventions with the document authors.

The great leap in information representation in going from
pictures to an alphabet is that fonts don't matter. You
probably want to extract the text and scrap the font
stuff. If text can not be extracted easily from the PDF itself,
you need to reduce it to pixels and then extract with
OCR software. Or, get the document author to only include
the important stuff to begin with.

>
> On 19 June 2010 08:44, 1T3XT info  wrote:
>> Ben Short wrote:
>>> subType is /Type3
>>>
>>> Does this help identify the problem?
>>
>> Yes, but it doesn't bring us closer to a solution.
>>
>> Type 3 fonts are "user defined fonts".
>>
>> See for instance:
>> http://itextpdf.com/examples/index.php?page=example&id=200
>> In that example, a 'delta' and 'sigma' shaped glyph was defined,
>> corresponding with the characters 'D' and 'S'. However, the example
>> would also have worked if we'd used any other character.
>>
>> Another example: we could define a glyph that looks like the symbol for
>> 'The Artist Formerly Known As Prince' to correspond with the character
>> 'P'. That's what Type 3 fonts are about: they can be used when a user
>> needs a glyph that isn't provided in any other font.
>> Therefore it's very hard to extract that content: how are you going to
>> know that the glyph corresponding with 'P' needs to be 'translated' to
>> 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE
>> code point for that glyph.
>>
>> I think you've hit a limitation regarding text extraction in general.
>> --
>> This answer is provided by 1T3XT BVBA
>> http://www.1t3xt.com/ - http://www.1t3xt.info
>>
>> --
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> ___
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions: 
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit. See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
  
_
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID2832

Re: [iText-questions] NPE while Extracting text

2010-06-21 Thread Ben Short
Thanks very much for this information.

Maybe you could offer me some direction of how to solve my problem?

I need to parse pdf mobile phone bills. the information i require is
the itemized data that is in a table format. Is this possible with
itextpdf?

On 19 June 2010 08:44, 1T3XT info  wrote:
> Ben Short wrote:
>> subType is /Type3
>>
>> Does this help identify the problem?
>
> Yes, but it doesn't bring us closer to a solution.
>
> Type 3 fonts are "user defined fonts".
>
> See for instance:
> http://itextpdf.com/examples/index.php?page=example&id=200
> In that example, a 'delta' and 'sigma' shaped glyph was defined,
> corresponding with the characters 'D' and 'S'. However, the example
> would also have worked if we'd used any other character.
>
> Another example: we could define a glyph that looks like the symbol for
> 'The Artist Formerly Known As Prince' to correspond with the character
> 'P'. That's what Type 3 fonts are about: they can be used when a user
> needs a glyph that isn't provided in any other font.
> Therefore it's very hard to extract that content: how are you going to
> know that the glyph corresponding with 'P' needs to be 'translated' to
> 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE
> code point for that glyph.
>
> I think you've hit a limitation regarding text extraction in general.
> --
> This answer is provided by 1T3XT BVBA
> http://www.1t3xt.com/ - http://www.1t3xt.info
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-19 Thread 1T3XT info
Ben Short wrote:
> subType is /Type3
> 
> Does this help identify the problem?

Yes, but it doesn't bring us closer to a solution.

Type 3 fonts are "user defined fonts".

See for instance:
http://itextpdf.com/examples/index.php?page=example&id=200
In that example, a 'delta' and 'sigma' shaped glyph was defined, 
corresponding with the characters 'D' and 'S'. However, the example 
would also have worked if we'd used any other character.

Another example: we could define a glyph that looks like the symbol for 
'The Artist Formerly Known As Prince' to correspond with the character 
'P'. That's what Type 3 fonts are about: they can be used when a user 
needs a glyph that isn't provided in any other font.
Therefore it's very hard to extract that content: how are you going to 
know that the glyph corresponding with 'P' needs to be 'translated' to 
'The Artist Formerly Known As Prince'? I don't think there's a UNICODE 
code point for that glyph.

I think you've hit a limitation regarding text extraction in general.
-- 
This answer is provided by 1T3XT BVBA
http://www.1t3xt.com/ - http://www.1t3xt.info

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-18 Thread Ben Short
Hi,

I have debugged and found that in the displayPdfString method of the
PdfContentStreamProcessor class the string parameter is valid but it
is decoded to a string of the same length but all bytes are set to 0.

private void displayPdfString(PdfString string){

String unicode = decode(string);

Drilling down deeper the CMapAwareDocumentFont has no toUnicodeCmap so
in the decodeSingleCID method the cidbyte2uni is used. The cidbyte2uni
has a length of 255 chars which are all set to int 0.

cidbyte2uni is not populated as uni2byte hashtable is empty.

I can then see that the fillEncoding method is not called and nor is
the doType1TT

In the DocumentFonts constructor the font variable has the following
in its hash map

{/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0,
0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134
0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000,
/Widths=135 0 R, /FirstChar=32}

baseFont is null

fontName is "Unspecified Font Name"

subType is /Type3

Does this help identify the problem?

Regards

Ben



On 18 June 2010 11:01, Ben Short  wrote:
> Hi Kevin,
>
> I'm happy to dig in to the code. Can you point me to a place to start 
> debugging?
>
> Ben
>
> On 18 June 2010 00:04, Kevin Day  wrote:
>>
>> ok - most likely the font is using an encoding that we just don't have
>> support for yet.  The encodings are a bit of a hack right now, so these
>> unusual cases are tough to deal with.
>>
>> If you are willing to dig in to the code, I can provide assistance.
>>
>> - K
>> --
>> View this message in context: 
>> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html
>> Sent from the iText - General mailing list archive at Nabble.com.
>>
>> --
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> ___
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions: 
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-18 Thread Ben Short
Hi Kevin,

I'm happy to dig in to the code. Can you point me to a place to start debugging?

Ben

On 18 June 2010 00:04, Kevin Day  wrote:
>
> ok - most likely the font is using an encoding that we just don't have
> support for yet.  The encodings are a bit of a hack right now, so these
> unusual cases are tough to deal with.
>
> If you are willing to dig in to the code, I can provide assistance.
>
> - K
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-17 Thread Kevin Day

ok - most likely the font is using an encoding that we just don't have
support for yet.  The encodings are a bit of a hack right now, so these
unusual cases are tough to deal with.

If you are willing to dig in to the code, I can provide assistance.

- K
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-17 Thread Ben Short
OK so I changed the code to write the output of the Text Extraction to
a ByteArrayOutputStream. Looking at the contents of the
ByteArrayOutputStream I can see that most bytes have an int value of 0
and some have an int value of 32.



On 17 June 2010 23:12, Mark Storer  wrote:
> Err... why wordpad?  Is that another way of saying "all the bytes are 0x00" 
> or what?  Did you try... opening it with Reader?  Or a PdfReader for that 
> matter.
>
> --Mark Storer
>  Senior Software Engineer
>  Cardiff.com
>
> import legalese.Disclaimer;
> Disclaimer DisCard = null;
>
>
>> -Original Message-
>> From: Ben Short [mailto:b...@benshort.co.uk]
>> Sent: Thursday, June 17, 2010 2:47 PM
>> To: Post all your questions about iText here
>> Subject: Re: [iText-questions] NPE while Extracting text
>>
>> Hi,
>>
>> I downloaded and built the latest source code and the exception is no
>> longer thrown. Now I'm left with a file that's 101KB in size but shows
>> no content when opened in wordpad.
>>
>> Am I missing something?
>>
>> Ben
>>
>> On 17 June 2010 09:08, Ben Short  wrote:
>> > Hi Kevin,
>> >
>> > Thats for this. I'll give it a go tonight.
>> >
>> > Ben
>> >
>> > On 17 June 2010 01:17, Kevin Day  wrote:
>> >>
>> >> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for
>> that
>> >> matter).  I had the same reaction when I first ran into this issue, but
>> the
>> >> spec never lies, right?  It just injects ambiguity and confusion.
>> >> --
>> >> View this message in context: http://itext-
>> general.2136553.n4.nabble.com/NPE-while-Extracting-text-
>> tp2256512p2258064.html
>> >> Sent from the iText - General mailing list archive at Nabble.com.
>> >>
>> >> ---
>> ---
>> >> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> >> lucky parental unit.  See the prize list and enter to win:
>> >> http://p.sf.net/sfu/thinkgeek-promo
>> >> ___
>> >> iText-questions mailing list
>> >> iText-questions@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/itext-questions
>> >>
>> >> Buy the iText book: http://www.itextpdf.com/book/
>> >> Check the site with examples before you ask questions:
>> http://www.1t3xt.info/examples/
>> >> You can also search the keywords list:
>> http://1t3xt.info/tutorials/keywords/
>> >>
>> >
>>
>> --
>> 
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> ___
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions:
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list:
>> http://1t3xt.info/tutorials/keywords/
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.829 / Virus Database: 271.1.1/2944 - Release Date: 06/17/10
>> 04:33:00
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-17 Thread Mark Storer
Err... why wordpad?  Is that another way of saying "all the bytes are 0x00" or 
what?  Did you try... opening it with Reader?  Or a PdfReader for that matter.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer DisCard = null;
 

> -Original Message-
> From: Ben Short [mailto:b...@benshort.co.uk]
> Sent: Thursday, June 17, 2010 2:47 PM
> To: Post all your questions about iText here
> Subject: Re: [iText-questions] NPE while Extracting text
> 
> Hi,
> 
> I downloaded and built the latest source code and the exception is no
> longer thrown. Now I'm left with a file that's 101KB in size but shows
> no content when opened in wordpad.
> 
> Am I missing something?
> 
> Ben
> 
> On 17 June 2010 09:08, Ben Short  wrote:
> > Hi Kevin,
> >
> > Thats for this. I'll give it a go tonight.
> >
> > Ben
> >
> > On 17 June 2010 01:17, Kevin Day  wrote:
> >>
> >> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for
> that
> >> matter).  I had the same reaction when I first ran into this issue, but
> the
> >> spec never lies, right?  It just injects ambiguity and confusion.
> >> --
> >> View this message in context: http://itext-
> general.2136553.n4.nabble.com/NPE-while-Extracting-text-
> tp2256512p2258064.html
> >> Sent from the iText - General mailing list archive at Nabble.com.
> >>
> >> ---
> ---
> >> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> >> lucky parental unit.  See the prize list and enter to win:
> >> http://p.sf.net/sfu/thinkgeek-promo
> >> ___
> >> iText-questions mailing list
> >> iText-questions@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/itext-questions
> >>
> >> Buy the iText book: http://www.itextpdf.com/book/
> >> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> >> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> >>
> >
> 
> --
> 
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2944 - Release Date: 06/17/10
> 04:33:00

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-17 Thread Ben Short
Hi,

I downloaded and built the latest source code and the exception is no
longer thrown. Now I'm left with a file that's 101KB in size but shows
no content when opened in wordpad.

Am I missing something?

Ben

On 17 June 2010 09:08, Ben Short  wrote:
> Hi Kevin,
>
> Thats for this. I'll give it a go tonight.
>
> Ben
>
> On 17 June 2010 01:17, Kevin Day  wrote:
>>
>> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that
>> matter).  I had the same reaction when I first ran into this issue, but the
>> spec never lies, right?  It just injects ambiguity and confusion.
>> --
>> View this message in context: 
>> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html
>> Sent from the iText - General mailing list archive at Nabble.com.
>>
>> --
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> ___
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions: 
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-17 Thread Ben Short
Hi Kevin,

Thats for this. I'll give it a go tonight.

Ben

On 17 June 2010 01:17, Kevin Day  wrote:
>
> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that
> matter).  I had the same reaction when I first ran into this issue, but the
> spec never lies, right?  It just injects ambiguity and confusion.
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-16 Thread Kevin Day

Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that
matter).  I had the same reaction when I first ran into this issue, but the
spec never lies, right?  It just injects ambiguity and confusion.
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-16 Thread Kevin Day

ok - I ran into this issue myself a month or so ago.  It's been fixed in the
5.0.3 codebase (which is the current HEAD in SVN).

/** Creates a new instance of DocumentFont */
DocumentFont(PRIndirectReference refFont) {
encoding = "";
fontSpecific = false;
this.refFont = refFont;
fontType = FONT_TYPE_DOCUMENT;
font = (PdfDictionary)PdfReader.getPdfObject(refFont);
PdfName baseFont = font.getAsName(PdfName.BASEFONT);
fontName = baseFont != null ?
PdfName.decodeName(baseFont.toString()) : "Unspecified Font Name"; // *
this is the line with the fix
PdfName subType = font.getAsName(PdfName.SUBTYPE);

the bolded text above is the fix.

- K
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258059.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-16 Thread Mark Storer
According to the PDF Reference (ISO 32000), BaseFont is a required
field.  So where did this PDF come from?

It's possible that the font's FondeDescriptor has a /FontName entry
(also required, and required to match BaseName).  I suspect Adobe has
bullet-proofed their applications to the point where either will
suffice.  Heck, they could even query the font program directly if it
was available.

So while I wouldn't go so far as to call it a bug in iText, we certainly
could be more durable in the face of malformed PDFs.  At the very least,
we could throw something more meaningful than an NPE.

Ben?  Do you have Acrobat 9 Pro?  It has a PDF syntax check that would
(hopefully) help you reach this sort of conclusion much faster, and
wouldn't require you to fold/spindle/mutilate your PDF for public
consumption.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer DisCard = null;
 

> -Original Message-
> From: Ben Short [mailto:b...@benshort.co.uk]
> Sent: Wednesday, June 16, 2010 3:12 PM
> To: Post all your questions about iText here
> Subject: Re: [iText-questions] NPE while Extracting text
> 
> On Mark's advice I downloaded the source code from the 5.0.2 branch
> and dug a little deeper...
> 
> The NPE is thrown on the following line of the DocumentFont
constructor.
> 
> fontName =
> PdfName.decodeName(font.getAsName(PdfName.BASEFONT).toString());
> 
> It turns out that font.getAsName(PdfName.BASEFONT) returns null.
> 
> font, which is a PdfDictionary, has the following values in its hash
> map...
> 
> {/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0,
> 0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134
> 0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000,
> /Widths=135 0 R, /FirstChar=32}
> 
> You'll notice that there is no key for /BaseFont.
> 
> I'm not sure I can post the whole pdf to a public news group... I will
> see if I can cut it down to a page or so of none sensitive data first.
> 
> Ben
> 

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-16 Thread Ben Short
On Mark's advice I downloaded the source code from the 5.0.2 branch
and dug a little deeper...

The NPE is thrown on the following line of the DocumentFont constructor.

fontName = PdfName.decodeName(font.getAsName(PdfName.BASEFONT).toString());

It turns out that font.getAsName(PdfName.BASEFONT) returns null.

font, which is a PdfDictionary, has the following values in its hash map...

{/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0,
0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134
0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000,
/Widths=135 0 R, /FirstChar=32}

You'll notice that there is no key for /BaseFont.

I'm not sure I can post the whole pdf to a public news group... I will
see if I can cut it down to a page or so of none sensitive data first.

Ben


On 16 June 2010 16:31, Kevin Day  wrote:
>
> I will add to Mark's (excellent) stream of consciousness analysis:
>
> The next step is to see what the name of the font resource is that is
> causing the problem.  Then, load RUPS and dig into the page dictionary and
> find the entry for that font resource - given what Mark is showing in the
> source, most likely the font resource isn't defined.
>
> There's always a question with this sort of thing about 'why does Acrobat
> show the file OK' - the answer is that Acrobat is very permissive - there
> are all sorts of problems like this that it may silently ignore.  Older
> versions of Acrobat often show problems that newer versions ignore.
>
> And then, of course, there's always the possibility that there's a problem
> with iText, and the strategy for looking up font resources isn't quite in
> sync with the PDF spec.  I don't *think* that is the case here, but it's
> always possible.
>
> If you do wind up providing the PDF so we can take a look, be sure to also
> provide font file that think may be involved.
>
> - K
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2257485.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> --
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> ___
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-16 Thread Kevin Day

I will add to Mark's (excellent) stream of consciousness analysis:

The next step is to see what the name of the font resource is that is
causing the problem.  Then, load RUPS and dig into the page dictionary and
find the entry for that font resource - given what Mark is showing in the
source, most likely the font resource isn't defined.

There's always a question with this sort of thing about 'why does Acrobat
show the file OK' - the answer is that Acrobat is very permissive - there
are all sorts of problems like this that it may silently ignore.  Older
versions of Acrobat often show problems that newer versions ignore.

And then, of course, there's always the possibility that there's a problem
with iText, and the strategy for looking up font resources isn't quite in
sync with the PDF spec.  I don't *think* that is the case here, but it's
always possible.

If you do wind up providing the PDF so we can take a look, be sure to also
provide font file that think may be involved.

- K
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2257485.html
Sent from the iText - General mailing list archive at Nabble.com.

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


Re: [iText-questions] NPE while Extracting text

2010-06-15 Thread Mark Storer
http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd
f/text/pdf/DocumentFont.java?revision=4515&view=markup


> java.lang.NullPointerException
>   at
com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114)
108  DocumentFont(PRIndirectReference refFont) {
109 encoding = "";
110 fontSpecific = false;
111 this.refFont = refFont;
112 fontType = FONT_TYPE_DOCUMENT;
113 font = (PdfDictionary)PdfReader.getPdfObject(refFont);
114 PdfName baseFont = font.getAsName(PdfName.BASEFONT);  <-- boom.


That means PdfReader.getPdfObject(refFont) returned null.  Having a look
over there...

http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd
f/text/pdf/PdfReader.java?revision=4507&view=markup

(lines 812 - 846, I'll let you look it up)

If refFont == null, you get a null, and if if
ref.getReader().getPdfObject(ref.getNumber) returns null, you get null
(but that Should Not Happen).

So someone passed in null.  Which leads us up the call stack:

> at
com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont
.java:79)
78   public CMapAwareDocumentFont(PRIndirectReference refFont) {
79 super(refFont);

(I'll let you figure out the links too.  Start at
http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd
f/text/pdf/ and work your way down from there).

Again, looks like someone passed in a null to the constructor.

> at
>
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invok
e(
> PdfContentStreamProcessor.java:591)
(and I had to look at a previous revision of the file to get the line
numbers to make sense:
http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd
f/text/pdf/parser/PdfContentStreamProcessor.java?revision=4410&view=mark
up)

585  private static class SetTextFont implements ContentOperator{
586 public void invoke(PdfContentStreamProcessor processor,
PdfLiteral operator, ArrayList operands) {
587 PdfName fontResourceName = (PdfName)operands.get(0);
588 float size = ((PdfNumber)operands.get(1)).floatValue();
589 
590 PdfDictionary fontsDictionary =
processor.resources.getAsDict(PdfName.FONT);
591 CMapAwareDocumentFont font = new
CMapAwareDocumentFont((PRIndirectReference)fontsDictionary.get(fontResou
rceName));

So fontsDictionary.get(fontResourceName) (in all probability) returned a
null.  Smells like a Bad PDF to me.  May we see it?


And hopefully folks will learn something from this
byte-array-output-stream-of-consciousness debug session.  In particular:

1) Use the Source, Luke.  Reach out with your browser.
2) iText's source is available on the web at
http://itext.svn.sourceforge.net/viewvc/itext/trunk/...
2.1) But adding the source to your classpath in Eclipse et al is better
while you're debugging.  Stepping into the code will TELL YOU whether
something is null or not, instead of making an educated guess as I have
here.


--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer DisCard = null;
 

> -Original Message-
> From: Ben Short [mailto:b...@benshort.co.uk]
> Sent: Tuesday, June 15, 2010 1:36 PM
> To: itext-questions@lists.sourceforge.net
> Subject: [iText-questions] NPE while Extracting text
> 
> Hi,
> 
> I'm trying to use iText 5.0.2 to extract the text from a pdf file
> using the following code...
> 
> PdfReader reader = new PdfReader("C:/development/May.pdf");
> PdfReaderContentParser parser = new
> PdfReaderContentParser(reader);
> PrintWriter out = new PrintWriter(System.out);
> TextExtractionStrategy strategy;
> for (int i = 1; i <= reader.getNumberOfPages(); i++) {
> strategy = parser.processContent(i, new
> SimpleTextExtractionStrategy());
> out.println(strategy.getResultantText());
> }
> 
> When I run this code I get the following exception.
> 
> java.lang.NullPointerException
>   at
com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114)
>   at
>
com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont
.j
> ava:79)
>   at
>
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invok
e(
> PdfContentStreamProcessor.java:591)
>   at
>
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Pd
fC
> ontentStreamProcessor.java:226)
>   at
>
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(Pd
fC
> ontentStreamProcessor.java:380)
>   at
>
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfRe
ad
> erContentParser.java:41)
> 
> I believe that this is something todo with the font not being
available?
> 
> I have used www.identifont.com and thin

[iText-questions] NPE while Extracting text

2010-06-15 Thread Ben Short
Hi,

I'm trying to use iText 5.0.2 to extract the text from a pdf file
using the following code...

PdfReader reader = new PdfReader("C:/development/May.pdf");
PdfReaderContentParser parser = new
PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(System.out);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new
SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}

When I run this code I get the following exception.

java.lang.NullPointerException
at com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114)
at 
com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont.java:79)
at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invoke(PdfContentStreamProcessor.java:591)
at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:226)
at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:380)
at 
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:41)

I believe that this is something todo with the font not being available?

I have used www.identifont.com and think that the font is Heldustry.
Should this not be available on my machine if acrobat reader can read
the file?

Can anyone give me a some help making this text extraction work?

Kind Regards

Ben Short

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/