Re: [iText-questions] extracting text from pdfs with japanese data

Hoppe, Michael Tue, 16 Dec 2008 00:31:21 -0800

The CMap-files are included in the iTextAsianCmaps.jar. So couldn't they be 
read from that jar in case there is no font information in the pdf?


 

Greetings

 

Michael

 

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 

Von: Leonard Rosenthol [mailto:[email protected]] 
Gesendet: Dienstag, 16. Dezember 2008 01:03
An: Post all your questions about iText here
Betreff: Re: [iText-questions] extracting text from pdfs with japanese data

 

No font or cmap - you need external info (either the font itself, a separate 
cmap file or both).

 

Leonard

 

On Dec 15, 2008, at 11:50 AM, Kevin Day wrote:





I ran these files through com.lowagie.text.pdf.parser.PdfContentReaderTool and 
I actually see the tokeniser fail on the first, then the font read fail on the 
second.

 

Here's the exception from content1.pdf:

 

Exception in thread "main" ExceptionConverter: java.io.IOException: '>' not 
expected at file pointer 39040

 

I suspect the issue with content1.pdf is that the encoding on the file itself 
is not something that is built into standard Java??  I'm not entirely sure on 
how this sort of thing gets handled, but the PDF file is processed 
byte-by-byte, so there is no character set transformation going on...  I'd have 
to hear other people's opinion on this.

 

 

 

Exception from tic_dogu2.pdf:

 

Exception in thread "main" java.lang.NullPointerException

at com.lowagie.text.pdf.PdfReader.getStreamBytes(PdfReader.java:2089)

 

This one is happening because the font resource can not be recovered from the 
file (the font isn't embedded).  This means that font metrics and CMap info 
would have to be recovered from an external file (no idea how to do this - it 
may be as simple as reading a CMap from an external source).  One thing that I 
note is that this file has no ToUnicode entry in any of the font references, 
which definitely implies that reading CMap from an external file would be 
necessary.

 

I believe that this would involve an adjustment to the DocumentFont to have it 
get the ToUnicode map from an external source if it isn't specified in the PDF 
itself.  This may also require adjustment to the CMapAwareDocumentFont class.  
Probably addition of a method to DocumentFont called getToUnicodeBytes() that 
has the additional logic.  Of course if we are doing surgery in that area, we 
should probably make adjustments to fillMetrics so it uses a CMap object 
directly (instead of the toUnicode byte array) - in which case the method in 
DocumentFont should be getCMap() (which would be a lot more object oriented, 
don't you think? :-)  ).

 

 

At this stage, I think we need to get input from other folks so we can figure 
out how to proceed.

 

- K

 

 

----------------------- Original Message -----------------------

  

From: "Hoppe, Michael" <[email protected]> 
<mailto:[email protected]> 

To: <[email protected]> 
<mailto:[email protected]> 

Cc: 

Date: Mon, 15 Dec 2008 13:45:47 +0100

Subject: [iText-questions] extracting text from pdfs with japanese data

  

Dear all,

My name is Michael Hoppe, i work for  the eSciDoc-Project that is funded by the 
german ministery of education and research (http://www.escidoc.org 
<http://www.escidoc.org/> ) . My part in the project is the search and indexing 
component where we index metadata and fulltexts in pdf. For the indexing we 
need to extract the text out of the pdf, using iText. I now have problems 
extracting the text from japanese pdfs where the font is not embedded. I either 
get grumbled data or an exception that says 'encoding not supported EUC-H'. 
Does anyone have an idea how to get the correct text for Japanese document with 
font not embedded? Two pdfs are attached.

Thanks in advance

M.Hoppe

 

Code Snippet:

try {

PdfReader reader = new PdfReader(inputFile);

PRTokeniser token;

StringBuilder builder = new StringBuilder();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {

byte[] pageBytes = reader.getPageContent(i);

if (pageBytes != null) {

token = new PRTokeniser(pageBytes);

while (true) {

try {

if (!token.nextToken()) {

break;

}

if (token.getTokenType() == PRTokeniser.TK_STRING) {

builder.append(token.getStringValue());

}

} catch (Exception e) {

System.out.println(e);

}

}

}

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

 

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 

-------------------------------------------------------
 
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische 
Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 
101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
 

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

 



-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische 
Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 
101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to