Re: [iText-questions] extracting text from pdfs with japanese data

Leonard Rosenthol Mon, 15 Dec 2008 16:04:47 -0800

No font or cmap - you need external info (either the font itself, aseparate cmap file or both).


Leonard


On Dec 15, 2008, at 11:50 AM, Kevin Day wrote:

I ran these files throughcom.lowagie.text.pdf.parser.PdfContentReaderTool and I actually seethe tokeniser fail on the first, then the font read fail on thesecond.
Here's the exception from content1.pdf:
Exception in thread "main" ExceptionConverter: java.io.IOException:'>' not expected at file pointer 39040
I suspect the issue with content1.pdf is that the encoding on thefile itself is not something that is built into standard Java?? I'mnot entirely sure on how this sort of thing gets handled, but thePDF file is processed byte-by-byte, so there is no character settransformation going on... I'd have to hear other people's opinionon this.
Exception from tic_dogu2.pdf:

Exception in thread "main" java.lang.NullPointerException
at com.lowagie.text.pdf.PdfReader.getStreamBytes(PdfReader.java:2089)
This one is happening because the font resource can not be recoveredfrom the file (the font isn't embedded). This means that fontmetrics and CMap info would have to be recovered from an externalfile (no idea how to do this - it may be as simple as reading a CMapfrom an external source). One thing that I note is that this filehas no ToUnicode entry in any of the font references, whichdefinitely implies that reading CMap from an external file would benecessary.
I believe that this would involve an adjustment to the DocumentFontto have it get the ToUnicode map from an external source if it isn'tspecified in the PDF itself. This may also require adjustment tothe CMapAwareDocumentFont class. Probably addition of a method toDocumentFont called getToUnicodeBytes() that has the additionallogic. Of course if we are doing surgery in that area, we shouldprobably make adjustments to fillMetrics so it uses a CMap objectdirectly (instead of the toUnicode byte array) - in which case themethod in DocumentFont should be getCMap() (which would be a lotmore object oriented, don't you think? :-) ).
At this stage, I think we need to get input from other folks so wecan figure out how to proceed.
- K


----------------------- Original Message -----------------------

From: "Hoppe, Michael" <[email protected]>
To: <[email protected]>
Cc:
Date: Mon, 15 Dec 2008 13:45:47 +0100
Subject: [iText-questions] extracting text from pdfs with japanesedata
Dear all,
My name is Michael Hoppe, i work for the eSciDoc-Project that isfunded by the german ministery of education and research (http://www.escidoc.org) . My part in the project is the search and indexing componentwhere we index metadata and fulltexts in pdf. For the indexing weneed to extract the text out of the pdf, using iText. I now haveproblems extracting the text from japanese pdfs where the font isnot embedded. I either get grumbled data or an exception that says‘encoding not supported EUC-H’. Does anyone have an idea how to getthe correct text for Japanese document with font not embedded? Twopdfs are attached.
Thanks in advance
M.Hoppe

Code Snippet:
try {
PdfReader reader = new PdfReader(inputFile);
PRTokeniser token;
StringBuilder builder = new StringBuilder();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
byte[] pageBytes = reader.getPageContent(i);
if (pageBytes != null) {
token = new PRTokeniser(pageBytes);
while (true) {
try {
if (!token.nextToken()) {
break;
}
if (token.getTokenType() == PRTokeniser.TK_STRING) {
builder.append(token.getStringValue());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.Sitz der Gesellschaft: Eggenstein-Leopoldshafen, AmtsgerichtMannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,Nevada.The future of the web can't happen without you. Join us at MIX09 tohelp
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,Nevada.The future of the web can't happen without you. Join us at MIX09 tohelp
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to