Re: [iText-questions] NPE while Extracting text
There are two ways to handle Type 3 encodings. 1) It's a newer Type3 and has an associated ToUnicode table - that's easy ;). 2) Use the name of the glyph (the key in the CharProcs table) against the Adobe Glyph List (<http://en.wikipedia.org/wiki/Adobe_Glyph_List>) which maps standard names to Unicode values. Leonard -Original Message- From: Kevin Day [mailto:ke...@trumpetinc.com] Sent: Monday, June 21, 2010 5:52 PM To: itext-questions@lists.sourceforge.net Subject: Re: [iText-questions] NPE while Extracting text The trick here is obtaining a mapping between the type 3 font glyphs and some sort of encoded text. There are several ways that this can be done, and they are fairly well supported by the text parser - but type 3 fonts, as has been mentioned, don't *usually* have this sort of mapping information. I know a lot of the PDF specification, but I don't know all of it - and it's quite possible that there is some mechanism for obtaining this sort of mapping. I guess the first thing to do is to ask whether Acrobat can figure the text out for these fonts (can you hi-light the text, copy and paste it into a text editor?). If they can, then it's time to dig into the PDF spec and figure out if there is some mapping strategy that isn't being handled by CMapAwareDocumentFont. What it sounds like to me is that the string that is passed into decode() is actually correct. Interestingly, looking at the font definition that you provide, there is a dictionary entry for Encoding. I think that this is where careful reading of the PDF spec is going to be required - so here are some resources to get you started: Here's the spec: http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf Section 9.6.5 discusses type 3 font dictionaries. I note that Type 3 fonts *can* have a ToUnicode entry. And they have an Encoding entry. So these sure sound an aweful lot like Type 1 fonts as far as text extraction is concerned. From a debugging perspective, I think that the next step is to do a debug walk through with a document containing normal Type 1 font, and comparing that with the walkthough of your document with Type 3 font. You may find that there's something subtle that can be tweaked to make this work. Please let me know what you find! -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2262853.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
The trick here is obtaining a mapping between the type 3 font glyphs and some sort of encoded text. There are several ways that this can be done, and they are fairly well supported by the text parser - but type 3 fonts, as has been mentioned, don't *usually* have this sort of mapping information. I know a lot of the PDF specification, but I don't know all of it - and it's quite possible that there is some mechanism for obtaining this sort of mapping. I guess the first thing to do is to ask whether Acrobat can figure the text out for these fonts (can you hi-light the text, copy and paste it into a text editor?). If they can, then it's time to dig into the PDF spec and figure out if there is some mapping strategy that isn't being handled by CMapAwareDocumentFont. What it sounds like to me is that the string that is passed into decode() is actually correct. Interestingly, looking at the font definition that you provide, there is a dictionary entry for Encoding. I think that this is where careful reading of the PDF spec is going to be required - so here are some resources to get you started: Here's the spec: http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf Section 9.6.5 discusses type 3 font dictionaries. I note that Type 3 fonts *can* have a ToUnicode entry. And they have an Encoding entry. So these sure sound an aweful lot like Type 1 fonts as far as text extraction is concerned. From a debugging perspective, I think that the next step is to do a debug walk through with a document containing normal Type 1 font, and comparing that with the walkthough of your document with Type 3 font. You may find that there's something subtle that can be tweaked to make this work. Please let me know what you find! -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2262853.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
> Date: Mon, 21 Jun 2010 09:49:44 +0100 > From: b...@benshort.co.uk > To: itext-questions@lists.sourceforge.net > Subject: Re: [iText-questions] NPE while Extracting text > > Thanks very much for this information. > > Maybe you could offer me some direction of how to solve my problem? > > I need to parse pdf mobile phone bills. the information i require is > the itemized data that is in a table format. Is this possible with > itextpdf? I know this won't help you but let's be clear- pdf is NOT the format of choice for DATA or INFORMATION. It is generally about human readability- and while this often has a describable structure, everyone here tells me it is too complicated to include that in the PDF file. If you have a choice, and have a cooperative relationship with the source of the documents, you want an INFORMATION format, not a bunch of pixels. "Scraping" html or pdf is often done by people trying to extract information from artwork but you always need to make assumptions about the document structure. If you want a robust means to do this, at least workout some conventions with the document authors. The great leap in information representation in going from pictures to an alphabet is that fonts don't matter. You probably want to extract the text and scrap the font stuff. If text can not be extracted easily from the PDF itself, you need to reduce it to pixels and then extract with OCR software. Or, get the document author to only include the important stuff to begin with. > > On 19 June 2010 08:44, 1T3XT info wrote: >> Ben Short wrote: >>> subType is /Type3 >>> >>> Does this help identify the problem? >> >> Yes, but it doesn't bring us closer to a solution. >> >> Type 3 fonts are "user defined fonts". >> >> See for instance: >> http://itextpdf.com/examples/index.php?page=example&id=200 >> In that example, a 'delta' and 'sigma' shaped glyph was defined, >> corresponding with the characters 'D' and 'S'. However, the example >> would also have worked if we'd used any other character. >> >> Another example: we could define a glyph that looks like the symbol for >> 'The Artist Formerly Known As Prince' to correspond with the character >> 'P'. That's what Type 3 fonts are about: they can be used when a user >> needs a glyph that isn't provided in any other font. >> Therefore it's very hard to extract that content: how are you going to >> know that the glyph corresponding with 'P' needs to be 'translated' to >> 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE >> code point for that glyph. >> >> I think you've hit a limitation regarding text extraction in general. >> -- >> This answer is provided by 1T3XT BVBA >> http://www.1t3xt.com/ - http://www.1t3xt.info >> >> -- >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeek-promo >> ___ >> iText-questions mailing list >> iText-questions@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> Buy the iText book: http://www.itextpdf.com/book/ >> Check the site with examples before you ask questions: >> http://www.1t3xt.info/examples/ >> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ >> > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ _ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID2832
Re: [iText-questions] NPE while Extracting text
Thanks very much for this information. Maybe you could offer me some direction of how to solve my problem? I need to parse pdf mobile phone bills. the information i require is the itemized data that is in a table format. Is this possible with itextpdf? On 19 June 2010 08:44, 1T3XT info wrote: > Ben Short wrote: >> subType is /Type3 >> >> Does this help identify the problem? > > Yes, but it doesn't bring us closer to a solution. > > Type 3 fonts are "user defined fonts". > > See for instance: > http://itextpdf.com/examples/index.php?page=example&id=200 > In that example, a 'delta' and 'sigma' shaped glyph was defined, > corresponding with the characters 'D' and 'S'. However, the example > would also have worked if we'd used any other character. > > Another example: we could define a glyph that looks like the symbol for > 'The Artist Formerly Known As Prince' to correspond with the character > 'P'. That's what Type 3 fonts are about: they can be used when a user > needs a glyph that isn't provided in any other font. > Therefore it's very hard to extract that content: how are you going to > know that the glyph corresponding with 'P' needs to be 'translated' to > 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE > code point for that glyph. > > I think you've hit a limitation regarding text extraction in general. > -- > This answer is provided by 1T3XT BVBA > http://www.1t3xt.com/ - http://www.1t3xt.info > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Ben Short wrote: > subType is /Type3 > > Does this help identify the problem? Yes, but it doesn't bring us closer to a solution. Type 3 fonts are "user defined fonts". See for instance: http://itextpdf.com/examples/index.php?page=example&id=200 In that example, a 'delta' and 'sigma' shaped glyph was defined, corresponding with the characters 'D' and 'S'. However, the example would also have worked if we'd used any other character. Another example: we could define a glyph that looks like the symbol for 'The Artist Formerly Known As Prince' to correspond with the character 'P'. That's what Type 3 fonts are about: they can be used when a user needs a glyph that isn't provided in any other font. Therefore it's very hard to extract that content: how are you going to know that the glyph corresponding with 'P' needs to be 'translated' to 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE code point for that glyph. I think you've hit a limitation regarding text extraction in general. -- This answer is provided by 1T3XT BVBA http://www.1t3xt.com/ - http://www.1t3xt.info -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Hi, I have debugged and found that in the displayPdfString method of the PdfContentStreamProcessor class the string parameter is valid but it is decoded to a string of the same length but all bytes are set to 0. private void displayPdfString(PdfString string){ String unicode = decode(string); Drilling down deeper the CMapAwareDocumentFont has no toUnicodeCmap so in the decodeSingleCID method the cidbyte2uni is used. The cidbyte2uni has a length of 255 chars which are all set to int 0. cidbyte2uni is not populated as uni2byte hashtable is empty. I can then see that the fillEncoding method is not called and nor is the doType1TT In the DocumentFonts constructor the font variable has the following in its hash map {/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0, 0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134 0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000, /Widths=135 0 R, /FirstChar=32} baseFont is null fontName is "Unspecified Font Name" subType is /Type3 Does this help identify the problem? Regards Ben On 18 June 2010 11:01, Ben Short wrote: > Hi Kevin, > > I'm happy to dig in to the code. Can you point me to a place to start > debugging? > > Ben > > On 18 June 2010 00:04, Kevin Day wrote: >> >> ok - most likely the font is using an encoding that we just don't have >> support for yet. The encodings are a bit of a hack right now, so these >> unusual cases are tough to deal with. >> >> If you are willing to dig in to the code, I can provide assistance. >> >> - K >> -- >> View this message in context: >> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html >> Sent from the iText - General mailing list archive at Nabble.com. >> >> -- >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeek-promo >> ___ >> iText-questions mailing list >> iText-questions@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> Buy the iText book: http://www.itextpdf.com/book/ >> Check the site with examples before you ask questions: >> http://www.1t3xt.info/examples/ >> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ >> > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Hi Kevin, I'm happy to dig in to the code. Can you point me to a place to start debugging? Ben On 18 June 2010 00:04, Kevin Day wrote: > > ok - most likely the font is using an encoding that we just don't have > support for yet. The encodings are a bit of a hack right now, so these > unusual cases are tough to deal with. > > If you are willing to dig in to the code, I can provide assistance. > > - K > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html > Sent from the iText - General mailing list archive at Nabble.com. > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
ok - most likely the font is using an encoding that we just don't have support for yet. The encodings are a bit of a hack right now, so these unusual cases are tough to deal with. If you are willing to dig in to the code, I can provide assistance. - K -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2259568.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
OK so I changed the code to write the output of the Text Extraction to a ByteArrayOutputStream. Looking at the contents of the ByteArrayOutputStream I can see that most bytes have an int value of 0 and some have an int value of 32. On 17 June 2010 23:12, Mark Storer wrote: > Err... why wordpad? Is that another way of saying "all the bytes are 0x00" > or what? Did you try... opening it with Reader? Or a PdfReader for that > matter. > > --Mark Storer > Senior Software Engineer > Cardiff.com > > import legalese.Disclaimer; > Disclaimer DisCard = null; > > >> -Original Message- >> From: Ben Short [mailto:b...@benshort.co.uk] >> Sent: Thursday, June 17, 2010 2:47 PM >> To: Post all your questions about iText here >> Subject: Re: [iText-questions] NPE while Extracting text >> >> Hi, >> >> I downloaded and built the latest source code and the exception is no >> longer thrown. Now I'm left with a file that's 101KB in size but shows >> no content when opened in wordpad. >> >> Am I missing something? >> >> Ben >> >> On 17 June 2010 09:08, Ben Short wrote: >> > Hi Kevin, >> > >> > Thats for this. I'll give it a go tonight. >> > >> > Ben >> > >> > On 17 June 2010 01:17, Kevin Day wrote: >> >> >> >> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for >> that >> >> matter). I had the same reaction when I first ran into this issue, but >> the >> >> spec never lies, right? It just injects ambiguity and confusion. >> >> -- >> >> View this message in context: http://itext- >> general.2136553.n4.nabble.com/NPE-while-Extracting-text- >> tp2256512p2258064.html >> >> Sent from the iText - General mailing list archive at Nabble.com. >> >> >> >> --- >> --- >> >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> >> lucky parental unit. See the prize list and enter to win: >> >> http://p.sf.net/sfu/thinkgeek-promo >> >> ___ >> >> iText-questions mailing list >> >> iText-questions@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> >> >> Buy the iText book: http://www.itextpdf.com/book/ >> >> Check the site with examples before you ask questions: >> http://www.1t3xt.info/examples/ >> >> You can also search the keywords list: >> http://1t3xt.info/tutorials/keywords/ >> >> >> > >> >> -- >> >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeek-promo >> ___ >> iText-questions mailing list >> iText-questions@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> Buy the iText book: http://www.itextpdf.com/book/ >> Check the site with examples before you ask questions: >> http://www.1t3xt.info/examples/ >> You can also search the keywords list: >> http://1t3xt.info/tutorials/keywords/ >> >> >> No virus found in this incoming message. >> Checked by AVG - www.avg.com >> Version: 9.0.829 / Virus Database: 271.1.1/2944 - Release Date: 06/17/10 >> 04:33:00 > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Err... why wordpad? Is that another way of saying "all the bytes are 0x00" or what? Did you try... opening it with Reader? Or a PdfReader for that matter. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer DisCard = null; > -Original Message- > From: Ben Short [mailto:b...@benshort.co.uk] > Sent: Thursday, June 17, 2010 2:47 PM > To: Post all your questions about iText here > Subject: Re: [iText-questions] NPE while Extracting text > > Hi, > > I downloaded and built the latest source code and the exception is no > longer thrown. Now I'm left with a file that's 101KB in size but shows > no content when opened in wordpad. > > Am I missing something? > > Ben > > On 17 June 2010 09:08, Ben Short wrote: > > Hi Kevin, > > > > Thats for this. I'll give it a go tonight. > > > > Ben > > > > On 17 June 2010 01:17, Kevin Day wrote: > >> > >> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for > that > >> matter). I had the same reaction when I first ran into this issue, but > the > >> spec never lies, right? It just injects ambiguity and confusion. > >> -- > >> View this message in context: http://itext- > general.2136553.n4.nabble.com/NPE-while-Extracting-text- > tp2256512p2258064.html > >> Sent from the iText - General mailing list archive at Nabble.com. > >> > >> --- > --- > >> ThinkGeek and WIRED's GeekDad team up for the Ultimate > >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > >> lucky parental unit. See the prize list and enter to win: > >> http://p.sf.net/sfu/thinkgeek-promo > >> ___ > >> iText-questions mailing list > >> iText-questions@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/itext-questions > >> > >> Buy the iText book: http://www.itextpdf.com/book/ > >> Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > >> You can also search the keywords list: > http://1t3xt.info/tutorials/keywords/ > >> > > > > -- > > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: > http://1t3xt.info/tutorials/keywords/ > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.829 / Virus Database: 271.1.1/2944 - Release Date: 06/17/10 > 04:33:00 -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Hi, I downloaded and built the latest source code and the exception is no longer thrown. Now I'm left with a file that's 101KB in size but shows no content when opened in wordpad. Am I missing something? Ben On 17 June 2010 09:08, Ben Short wrote: > Hi Kevin, > > Thats for this. I'll give it a go tonight. > > Ben > > On 17 June 2010 01:17, Kevin Day wrote: >> >> Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that >> matter). I had the same reaction when I first ran into this issue, but the >> spec never lies, right? It just injects ambiguity and confusion. >> -- >> View this message in context: >> http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html >> Sent from the iText - General mailing list archive at Nabble.com. >> >> -- >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeek-promo >> ___ >> iText-questions mailing list >> iText-questions@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> Buy the iText book: http://www.itextpdf.com/book/ >> Check the site with examples before you ask questions: >> http://www.1t3xt.info/examples/ >> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ >> > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Hi Kevin, Thats for this. I'll give it a go tonight. Ben On 17 June 2010 01:17, Kevin Day wrote: > > Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that > matter). I had the same reaction when I first ran into this issue, but the > spec never lies, right? It just injects ambiguity and confusion. > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html > Sent from the iText - General mailing list archive at Nabble.com. > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
Mark - FYI, basefont isn't required for Type3 fonts (or TrueType for that matter). I had the same reaction when I first ran into this issue, but the spec never lies, right? It just injects ambiguity and confusion. -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258064.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
ok - I ran into this issue myself a month or so ago. It's been fixed in the 5.0.3 codebase (which is the current HEAD in SVN). /** Creates a new instance of DocumentFont */ DocumentFont(PRIndirectReference refFont) { encoding = ""; fontSpecific = false; this.refFont = refFont; fontType = FONT_TYPE_DOCUMENT; font = (PdfDictionary)PdfReader.getPdfObject(refFont); PdfName baseFont = font.getAsName(PdfName.BASEFONT); fontName = baseFont != null ? PdfName.decodeName(baseFont.toString()) : "Unspecified Font Name"; // * this is the line with the fix PdfName subType = font.getAsName(PdfName.SUBTYPE); the bolded text above is the fix. - K -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2258059.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
According to the PDF Reference (ISO 32000), BaseFont is a required field. So where did this PDF come from? It's possible that the font's FondeDescriptor has a /FontName entry (also required, and required to match BaseName). I suspect Adobe has bullet-proofed their applications to the point where either will suffice. Heck, they could even query the font program directly if it was available. So while I wouldn't go so far as to call it a bug in iText, we certainly could be more durable in the face of malformed PDFs. At the very least, we could throw something more meaningful than an NPE. Ben? Do you have Acrobat 9 Pro? It has a PDF syntax check that would (hopefully) help you reach this sort of conclusion much faster, and wouldn't require you to fold/spindle/mutilate your PDF for public consumption. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer DisCard = null; > -Original Message- > From: Ben Short [mailto:b...@benshort.co.uk] > Sent: Wednesday, June 16, 2010 3:12 PM > To: Post all your questions about iText here > Subject: Re: [iText-questions] NPE while Extracting text > > On Mark's advice I downloaded the source code from the 5.0.2 branch > and dug a little deeper... > > The NPE is thrown on the following line of the DocumentFont constructor. > > fontName = > PdfName.decodeName(font.getAsName(PdfName.BASEFONT).toString()); > > It turns out that font.getAsName(PdfName.BASEFONT) returns null. > > font, which is a PdfDictionary, has the following values in its hash > map... > > {/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0, > 0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134 > 0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000, > /Widths=135 0 R, /FirstChar=32} > > You'll notice that there is no key for /BaseFont. > > I'm not sure I can post the whole pdf to a public news group... I will > see if I can cut it down to a page or so of none sensitive data first. > > Ben > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
On Mark's advice I downloaded the source code from the 5.0.2 branch and dug a little deeper... The NPE is thrown on the following line of the DocumentFont constructor. fontName = PdfName.decodeName(font.getAsName(PdfName.BASEFONT).toString()); It turns out that font.getAsName(PdfName.BASEFONT) returns null. font, which is a PdfDictionary, has the following values in its hash map... {/FontBBox=[-2, -9, 38, 40], /LastChar=121, /FontMatrix=[0.24, 0, 0, 0.24, 0, 0], /Type=/Font, /Resources=Dictionary, /CharProcs=134 0 R, /Encoding=72 0 R, /Subtype=/Type3, /Name=/C0HN2000T1X005000, /Widths=135 0 R, /FirstChar=32} You'll notice that there is no key for /BaseFont. I'm not sure I can post the whole pdf to a public news group... I will see if I can cut it down to a page or so of none sensitive data first. Ben On 16 June 2010 16:31, Kevin Day wrote: > > I will add to Mark's (excellent) stream of consciousness analysis: > > The next step is to see what the name of the font resource is that is > causing the problem. Then, load RUPS and dig into the page dictionary and > find the entry for that font resource - given what Mark is showing in the > source, most likely the font resource isn't defined. > > There's always a question with this sort of thing about 'why does Acrobat > show the file OK' - the answer is that Acrobat is very permissive - there > are all sorts of problems like this that it may silently ignore. Older > versions of Acrobat often show problems that newer versions ignore. > > And then, of course, there's always the possibility that there's a problem > with iText, and the strategy for looking up font resources isn't quite in > sync with the PDF spec. I don't *think* that is the case here, but it's > always possible. > > If you do wind up providing the PDF so we can take a look, be sure to also > provide font file that think may be involved. > > - K > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2257485.html > Sent from the iText - General mailing list archive at Nabble.com. > > -- > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > ___ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ > -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
I will add to Mark's (excellent) stream of consciousness analysis: The next step is to see what the name of the font resource is that is causing the problem. Then, load RUPS and dig into the page dictionary and find the entry for that font resource - given what Mark is showing in the source, most likely the font resource isn't defined. There's always a question with this sort of thing about 'why does Acrobat show the file OK' - the answer is that Acrobat is very permissive - there are all sorts of problems like this that it may silently ignore. Older versions of Acrobat often show problems that newer versions ignore. And then, of course, there's always the possibility that there's a problem with iText, and the strategy for looking up font resources isn't quite in sync with the PDF spec. I don't *think* that is the case here, but it's always possible. If you do wind up providing the PDF so we can take a look, be sure to also provide font file that think may be involved. - K -- View this message in context: http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2257485.html Sent from the iText - General mailing list archive at Nabble.com. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Re: [iText-questions] NPE while Extracting text
http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd f/text/pdf/DocumentFont.java?revision=4515&view=markup > java.lang.NullPointerException > at com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114) 108 DocumentFont(PRIndirectReference refFont) { 109 encoding = ""; 110 fontSpecific = false; 111 this.refFont = refFont; 112 fontType = FONT_TYPE_DOCUMENT; 113 font = (PdfDictionary)PdfReader.getPdfObject(refFont); 114 PdfName baseFont = font.getAsName(PdfName.BASEFONT); <-- boom. That means PdfReader.getPdfObject(refFont) returned null. Having a look over there... http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd f/text/pdf/PdfReader.java?revision=4507&view=markup (lines 812 - 846, I'll let you look it up) If refFont == null, you get a null, and if if ref.getReader().getPdfObject(ref.getNumber) returns null, you get null (but that Should Not Happen). So someone passed in null. Which leads us up the call stack: > at com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont .java:79) 78 public CMapAwareDocumentFont(PRIndirectReference refFont) { 79 super(refFont); (I'll let you figure out the links too. Start at http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd f/text/pdf/ and work your way down from there). Again, looks like someone passed in a null to the constructor. > at > com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invok e( > PdfContentStreamProcessor.java:591) (and I had to look at a previous revision of the file to get the line numbers to make sense: http://itext.svn.sourceforge.net/viewvc/itext/trunk/src/core/com/itextpd f/text/pdf/parser/PdfContentStreamProcessor.java?revision=4410&view=mark up) 585 private static class SetTextFont implements ContentOperator{ 586 public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList operands) { 587 PdfName fontResourceName = (PdfName)operands.get(0); 588 float size = ((PdfNumber)operands.get(1)).floatValue(); 589 590 PdfDictionary fontsDictionary = processor.resources.getAsDict(PdfName.FONT); 591 CMapAwareDocumentFont font = new CMapAwareDocumentFont((PRIndirectReference)fontsDictionary.get(fontResou rceName)); So fontsDictionary.get(fontResourceName) (in all probability) returned a null. Smells like a Bad PDF to me. May we see it? And hopefully folks will learn something from this byte-array-output-stream-of-consciousness debug session. In particular: 1) Use the Source, Luke. Reach out with your browser. 2) iText's source is available on the web at http://itext.svn.sourceforge.net/viewvc/itext/trunk/... 2.1) But adding the source to your classpath in Eclipse et al is better while you're debugging. Stepping into the code will TELL YOU whether something is null or not, instead of making an educated guess as I have here. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer DisCard = null; > -Original Message- > From: Ben Short [mailto:b...@benshort.co.uk] > Sent: Tuesday, June 15, 2010 1:36 PM > To: itext-questions@lists.sourceforge.net > Subject: [iText-questions] NPE while Extracting text > > Hi, > > I'm trying to use iText 5.0.2 to extract the text from a pdf file > using the following code... > > PdfReader reader = new PdfReader("C:/development/May.pdf"); > PdfReaderContentParser parser = new > PdfReaderContentParser(reader); > PrintWriter out = new PrintWriter(System.out); > TextExtractionStrategy strategy; > for (int i = 1; i <= reader.getNumberOfPages(); i++) { > strategy = parser.processContent(i, new > SimpleTextExtractionStrategy()); > out.println(strategy.getResultantText()); > } > > When I run this code I get the following exception. > > java.lang.NullPointerException > at com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114) > at > com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont .j > ava:79) > at > com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invok e( > PdfContentStreamProcessor.java:591) > at > com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Pd fC > ontentStreamProcessor.java:226) > at > com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(Pd fC > ontentStreamProcessor.java:380) > at > com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfRe ad > erContentParser.java:41) > > I believe that this is something todo with the font not being available? > > I have used www.identifont.com and thin
[iText-questions] NPE while Extracting text
Hi, I'm trying to use iText 5.0.2 to extract the text from a pdf file using the following code... PdfReader reader = new PdfReader("C:/development/May.pdf"); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(System.out); TextExtractionStrategy strategy; for (int i = 1; i <= reader.getNumberOfPages(); i++) { strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); out.println(strategy.getResultantText()); } When I run this code I get the following exception. java.lang.NullPointerException at com.itextpdf.text.pdf.DocumentFont.(DocumentFont.java:114) at com.itextpdf.text.pdf.CMapAwareDocumentFont.(CMapAwareDocumentFont.java:79) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invoke(PdfContentStreamProcessor.java:591) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:226) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:380) at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:41) I believe that this is something todo with the font not being available? I have used www.identifont.com and think that the font is Heldustry. Should this not be available on my machine if acrobat reader can read the file? Can anyone give me a some help making this text extraction work? Kind Regards Ben Short -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/