Hi Kevin,
 
Also sorry for the delay. I was in vacation until today.
The txt-file you attached to your last mail does not show any japanese 
characters but only gibberish (i am using a unicode editor, so it should show 
up correctly).
The output should look like the txt-file i attached to this mail.
Or didnt i get you correctly?
 
Thanks + greetings
 
Michael
 
Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 

________________________________

Von: Kevin Day [mailto:[email protected]]
Gesendet: Mo 05.01.2009 17:46
An: IText Questions
Betreff: Re: [iText-questions] extracting text from pdfs with japanese data


Sorry for the delay.  Because this file is relatively large, the chances of my 
extraction exactly equaling yours is highly unlikely.  The best I can do is 
generate a unicode file of my own, then you can take a look at it and see how 
it compares to yours.
 
I've attached the result of the PdfContentReaderTool run against tic_dougu2.pdf 
- this contains more information than just the text extraction.
 
Remember:  The objective at this point is to make sure that the characters are 
getting decoded properly - I have not done anything with spatial analysis here, 
so the words are all going to be merged together.  I just need you to confirm 
that the actual characters being exported are correct.  If they are correct, 
then the character mapping strategy I added to DocumentFont is working, and we 
can look into what it will take to derive character widths, etc...
 
- K

 
----------------------- Original Message -----------------------
  
From: "Hoppe, Michael" <[email protected]> 
<mailto:[email protected]> 
To: "Post all your questions about iText here" 
<[email protected]> 
<mailto:[email protected]> 
Cc: 
Date: Fri, 19 Dec 2008 08:08:08 +0100
Subject: Re: [iText-questions] extracting text from pdfs with japanese data
  

Kevin,

 

Unfortunately i cannot send you a shorter pdf. I got the pdf from people using 
our software in Japan and complaining about iText not working. They sayd their 
pdfs are generated with some software, so I cannot recreate a shorter pdf.

But I attached the Unicode-file for the tic_dogu2 pdf (extracted with PDFlib, a 
commercial software).

 

Thanks + Greetings

 

Michael 

 

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 

Von: Kevin Day [mailto:[email protected]] 
Gesendet: Freitag, 19. Dezember 2008 01:43
An: IText Questions
Betreff: Re: [iText-questions] extracting text from pdfs with japanese data

 

Michael-

 

Can you please send a PDF that uses the font in question, but is *simple* - 
maybe containing 2 lines with 3 or 4 words in each?

 

Also, please send a unicode file that has the text for those files.  I can't 
look at the fonts themselves and figure out whether the decoding I'm doing is 
actually working, but I can compare the results to a unicode file that has what 
the results should be.

 

- K

 

>      
>     ----------------------- Original Message -----------------------
>       
>     From: "Hoppe, Michael" <[email protected]> 
> <mailto:[email protected]>  
> <mailto:[email protected]> 
> <mailto:[email protected]>  
>     To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]>  
> <mailto:[email protected]> 
> <mailto:[email protected]>  
>     Cc: 
>     Date: Wed, 17 Dec 2008 17:12:58 +010 0
>   &n bsp; Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>       
>     Hi all,
>      
>     Attached see the Pdfs i had the problems with (I send 
> them once before)
>     content1.pdf gives : java.io.IOException: '>' not 
> expected at file pointer 39040
>     tic_dogu2.pdf gives java.lang.NullPointerException 
> because font is not embedded in pdf
>      
>     text from content1.pdf can get extracted with the adobe 
> viewer bean (another open source library that we don't want 
> to use for our project for various reasons) so I don't think 
> there is something wrong with the file itself.
>      
>    ;  Greetings 
>      
>     Michael
>      
>     Dr. Michael Hoppe
>     ePublishing & eScience
>     Development & Applied Research
>     Phone +49 7247 808-251
>     Fax +49 7247 808-133
>     [email protected]
>     
>     
>     FIZ Karlsruhe
>     Hermann-von-Helmholtz-Platz 1
>     76344 Eggenstein-Leopoldshafen, Germany
>     
>     www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/>  
> <http://www.fiz-karlsruhe.de/> 
> <https://remote.fiz-karlsruhe.de%0d%0a%20iz-karlsruhe.de/>  
>     Von: Kevin Day [mailto:[email protected]] 
> <mailto:[email protected]>  
>     Gesendet: Mittwoch, 17. Dezember 2008 15:31
>     An: IText Questions
>     Betreff: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>      
>     CMapAwareDocumentFont has this parsing via the CMap 
> class - this encapsulates the parsing behind an object, and 
> makes it a lot easier to deal with.
>      
>     I think that the biggest thing here is actually finding 
> the appropriate CMap data byte stream (either from embedded 
> data in the PDF, or from the file system) - right now, 
> loca ting the CMap information is a weak point in the content parser.
>      
>  &n bsp;  If the cmap data is included in a jar on the classpath, 
> then the CMap could absolutely be read from the jar.
>      
>     Can the OP please send a PDF that demonstrates the 
> issue?  I'll take a look at the font information and see how 
> tough it would be to add this type of lookup if TOUNICODE 
> isn't available.
>      
>     - K
>      
>     ----------------------- Original Message -----------------------
>       
>     From: "Paulo Soares" <[email protected]> <mailto:[email protected]>  
> <mailto:[email protected]> <mailto:psoa...@consist%0d%0a%20e.pt>  
>     To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]> 
> <mailto:[email protected]>  
>     Cc: 
>     Date: Tue, 16 Dec 2008 09:55:36 -0000
>     Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>       
>     There's code in PdfEncodings to parse and convert 
> to/from Unicode the cmaps. 
>     The font contains the cmap name.
>     
>     Paulo
>   &nb sp; 
>     ----- Original Message ----- 
>     From: "1T3XT info" <[email protected]> <mailto:[email protected]>  
> <mailto:[email protected]> <mailto:[email protected]%20o>  
>     To: "Post all your questions about iText here" 
>     <[email protected]> 
> <mailto:[email protected]>  
> <mailto:[email protected]> 
> <mailto:[email protected]>  
>     Sent: Tuesday, December 16, 2008 9:19 AM
>     Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>     
>     
>     H oppe, Michael wro te:
>     > The CMap-files are included in the 
> iTextAsianCmaps.jar. So couldn't they
>     > be read from that jar in case there is no font 
> information in the pdf?
>     
>    &nbs p;I'm just thinking out loud here, I didn't dive into the 
> problem yet,
>     but: do you think it's possible for iText to find which 
> CMap-file is t o
>     be inspected based on the font information availa ble 
> in the PDF?
>     
>     As Kevin already said: this part of iText is pretty 
> new. We're all
>     excited about it, but for the moment it's all highly 
> experimental.
>     -- 
>     This answer is provided by 1T3XT BVBA
> &nbs p;   http://www.1t3xt.com/ - http://www.1t3xt.info 
> <http://www.1t3xt.info/> 


Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter 
informação confidencial ou legalmente protegida. A incorrecta transmissão desta 
mensagem não significa a perca de confidencialidade. Se esta mensagem for 
recebida por engano, por favor envie-a de volta para o remetente e apague-a do 
seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de 
usar, revelar ou distribuir qualquer parte desta mensagem. 

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain 
confidential or legally protected information. The incorrect transmission of 
this message does not mean the loss of its confidentiality. If this message is 
received by mistake, pleas e send it back to the sender and delete it from your 
system immediately. It is forbidden to any person who is not the intended 
receiver to use, distribute or copy any part of this message.




------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions 
<https://lists.sourceforge.net/lists/listinfo/itext-qu%0d%0a%20estions> 

Buy the iText book: http://www.1t3xt.com/docs/book.php

-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische 
Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 
101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.

------------------------------------------------------------------------------



_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php




-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische 
Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 
101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.

<<winmail.dat>>

------------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to