No font or cmap - you need external info (either the font itself, a
separate cmap file or both).
Leonard
On Dec 15, 2008, at 11:50 AM, Kevin Day wrote:
I ran these files through
com.lowagie.text.pdf.parser.PdfContentReaderTool and I actually see
the tokeniser fail on the first, then the font read fail on the
second.
Here's the exception from content1.pdf:
Exception in thread "main" ExceptionConverter: java.io.IOException:
'>' not expected at file pointer 39040
I suspect the issue with content1.pdf is that the encoding on the
file itself is not something that is built into standard Java?? I'm
not entirely sure on how this sort of thing gets handled, but the
PDF file is processed byte-by-byte, so there is no character set
transformation going on... I'd have to hear other people's opinion
on this.
Exception from tic_dogu2.pdf:
Exception in thread "main" java.lang.NullPointerException
at com.lowagie.text.pdf.PdfReader.getStreamBytes(PdfReader.java:2089)
This one is happening because the font resource can not be recovered
from the file (the font isn't embedded). This means that font
metrics and CMap info would have to be recovered from an external
file (no idea how to do this - it may be as simple as reading a CMap
from an external source). One thing that I note is that this file
has no ToUnicode entry in any of the font references, which
definitely implies that reading CMap from an external file would be
necessary.
I believe that this would involve an adjustment to the DocumentFont
to have it get the ToUnicode map from an external source if it isn't
specified in the PDF itself. This may also require adjustment to
the CMapAwareDocumentFont class. Probably addition of a method to
DocumentFont called getToUnicodeBytes() that has the additional
logic. Of course if we are doing surgery in that area, we should
probably make adjustments to fillMetrics so it uses a CMap object
directly (instead of the toUnicode byte array) - in which case the
method in DocumentFont should be getCMap() (which would be a lot
more object oriented, don't you think? :-) ).
At this stage, I think we need to get input from other folks so we
can figure out how to proceed.
- K
----------------------- Original Message -----------------------
From: "Hoppe, Michael" <[email protected]>
To: <[email protected]>
Cc:
Date: Mon, 15 Dec 2008 13:45:47 +0100
Subject: [iText-questions] extracting text from pdfs with japanese
data
Dear all,
My name is Michael Hoppe, i work for the eSciDoc-Project that is
funded by the german ministery of education and research (http://www.escidoc.org
) . My part in the project is the search and indexing component
where we index metadata and fulltexts in pdf. For the indexing we
need to extract the text out of the pdf, using iText. I now have
problems extracting the text from japanese pdfs where the font is
not embedded. I either get grumbled data or an exception that says
‘encoding not supported EUC-H’. Does anyone have an idea how to get
the correct text for Japanese document with font not embedded? Two
pdfs are attached.
Thanks in advance
M.Hoppe
Code Snippet:
try {
PdfReader reader = new PdfReader(inputFile);
PRTokeniser token;
StringBuilder builder = new StringBuilder();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
byte[] pageBytes = reader.getPageContent(i);
if (pageBytes != null) {
token = new PRTokeniser(pageBytes);
while (true) {
try {
if (!token.nextToken()) {
break;
}
if (token.getTokenType() == PRTokeniser.TK_STRING) {
builder.append(token.getStringValue());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]
FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
www.fiz-karlsruhe.de
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-
technische Information mbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht
Mannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
Nevada.
The future of the web can't happen without you. Join us at MIX09 to
help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
Nevada.
The future of the web can't happen without you. Join us at MIX09 to
help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php