Re: [iText-questions] extracting text from pdfs with japanese data

Paulo Soares Wed, 17 Dec 2008 08:37:24 -0800

Your PDF has inline images and PdfContentParser doesn't support them.

Paulo


> -----Original Message-----
> From: Hoppe, Michael [mailto:[email protected]] 
> Sent: Wednesday, December 17, 2008 4:13 PM
> To: Post all your questions about iText here
> Subject: Re: [iText-questions] extracting text from pdfs with 
> japanese data
> 
> Hi all,
> 
>  
> 
> Attached see the Pdfs i had the problems with (I send them 
> once before) 
> 
> content1.pdf gives : java.io.IOException: '>' not expected at 
> file pointer 39040
> 
> tic_dogu2.pdf gives java.lang.NullPointerException because 
> font is not embedded in pdf
> 
>  
> 
> text from content1.pdf can get extracted with the adobe 
> viewer bean (another open source library that we don't want 
> to use for our project for various reasons) so I don't think 
> there is something wrong with the file itself.
> 
>  
> 
> Greetings
> 
>  
> 
> Michael
> 
>  
> 
> Dr. Michael Hoppe
> ePublishing & eScience
> Development & Applied Research
> Phone +49 7247 808-251
> Fax +49 7247 808-133
> [email protected]
> 
> 
> FIZ Karlsruhe
> Hermann-von-Helmholtz-Platz 1
> 76344 Eggenstein-Leopoldshafen, Germany
> 
> www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 
> 
> Von: Kevin Day [mailto:[email protected]] 
> Gesendet: Mittwoch, 17. Dezember 2008 15:31
> An: IText Questions
> Betreff: Re: [iText-questions] extracting text from pdfs with 
> japanese data
> 
>  
> 
> CMapAwareDocumentFont has this parsing via the CMap class - 
> this encapsulates the parsing behind an object, and makes it 
> a lot easier to deal with.
> 
>  
> 
> I think that the biggest thing here is actually finding the 
> appropriate CMap data byte stream (either from embedded data 
> in the PDF, or from the file system) - right now, locating 
> the CMap information is a weak point in the content parser.
> 
>  
> 
> If the cmap data is included in a jar on the classpath, then 
> the CMap could absolutely be read from the jar.
> 
>  
> 
> Can the OP please send a PDF that demonstrates the issue?  
> I'll take a look at the font information and see how tough it 
> would be to add this type of lookup if TOUNICODE isn't available.
> 
>  
> 
> - K
> 
>  
> 
> ----------------------- Original Message -----------------------
> 
>   
> 
> From: "Paulo Soares" <[email protected]> 
> <mailto:[email protected]> 
> 
> To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]> 
> 
> Cc: 
> 
> Date: Tue, 16 Dec 2008 09:55:36 -0000
> 
> Subject: Re: [iText-questions] extracting text from pdfs with 
> japanese data
> 
>   
> 
> There's code in PdfEncodings to parse and convert to/from 
> Unicode the cmaps. 
> The font contains the cmap name.
> 
> Paulo
> 
> ----- Original Message ----- 
> From: "1T3XT info" <[email protected]> <mailto:[email protected]> 
> To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]> 
> Sent: Tuesday, December 16, 2008 9:19 AM
> Subject: Re: [iText-questions] extracting text from pdfs with 
> japanese data
> 
> 
> Hoppe, Michael wrote:
> > The CMap-files are included in the iTextAsianCmaps.jar. So 
> couldn't they
> > be read from that jar in case there is no font information 
> in the pdf?
> 
> I'm just thinking out loud here, I didn't dive into the problem yet,
> but: do you think it's possible for iText to find which 
> CMap-file is to
> be inspected based on the font information availa ble in the PDF?
> 
> As Kevin already said: this part of iText is pretty new. We're all
> excited about it, but for the moment it's all highly experimental.
> -- 
> This answer is provided by 1T3XT BVBA
> http://www.1t3xt.com/ - http://www.1t3xt.info


Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter 
informação confidencial ou legalmente protegida. A incorrecta transmissão desta 
mensagem não significa a perca de confidencialidade. Se esta mensagem for 
recebida por engano, por favor envie-a de volta para o remetente e apague-a do 
seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de 
usar, revelar ou distribuir qualquer parte desta mensagem. 

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain 
confidential or legally protected information. The incorrect transmission of 
this message does not mean the loss of its confidentiality. If this message is 
received by mistake, please send it back to the sender and delete it from your 
system immediately. It is forbidden to any person who is not the intended 
receiver to use, distribute or copy any part of this message.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to