The CMap-files are included in the iTextAsianCmaps.jar. So couldn't they be
read from that jar in case there is no font information in the pdf?
Greetings
Michael
Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]
FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/>
Von: Leonard Rosenthol [mailto:[email protected]]
Gesendet: Dienstag, 16. Dezember 2008 01:03
An: Post all your questions about iText here
Betreff: Re: [iText-questions] extracting text from pdfs with japanese data
No font or cmap - you need external info (either the font itself, a separate
cmap file or both).
Leonard
On Dec 15, 2008, at 11:50 AM, Kevin Day wrote:
I ran these files through com.lowagie.text.pdf.parser.PdfContentReaderTool and
I actually see the tokeniser fail on the first, then the font read fail on the
second.
Here's the exception from content1.pdf:
Exception in thread "main" ExceptionConverter: java.io.IOException: '>' not
expected at file pointer 39040
I suspect the issue with content1.pdf is that the encoding on the file itself
is not something that is built into standard Java?? I'm not entirely sure on
how this sort of thing gets handled, but the PDF file is processed
byte-by-byte, so there is no character set transformation going on... I'd have
to hear other people's opinion on this.
Exception from tic_dogu2.pdf:
Exception in thread "main" java.lang.NullPointerException
at com.lowagie.text.pdf.PdfReader.getStreamBytes(PdfReader.java:2089)
This one is happening because the font resource can not be recovered from the
file (the font isn't embedded). This means that font metrics and CMap info
would have to be recovered from an external file (no idea how to do this - it
may be as simple as reading a CMap from an external source). One thing that I
note is that this file has no ToUnicode entry in any of the font references,
which definitely implies that reading CMap from an external file would be
necessary.
I believe that this would involve an adjustment to the DocumentFont to have it
get the ToUnicode map from an external source if it isn't specified in the PDF
itself. This may also require adjustment to the CMapAwareDocumentFont class.
Probably addition of a method to DocumentFont called getToUnicodeBytes() that
has the additional logic. Of course if we are doing surgery in that area, we
should probably make adjustments to fillMetrics so it uses a CMap object
directly (instead of the toUnicode byte array) - in which case the method in
DocumentFont should be getCMap() (which would be a lot more object oriented,
don't you think? :-) ).
At this stage, I think we need to get input from other folks so we can figure
out how to proceed.
- K
----------------------- Original Message -----------------------
From: "Hoppe, Michael" <[email protected]>
<mailto:[email protected]>
To: <[email protected]>
<mailto:[email protected]>
Cc:
Date: Mon, 15 Dec 2008 13:45:47 +0100
Subject: [iText-questions] extracting text from pdfs with japanese data
Dear all,
My name is Michael Hoppe, i work for the eSciDoc-Project that is funded by the
german ministery of education and research (http://www.escidoc.org
<http://www.escidoc.org/> ) . My part in the project is the search and indexing
component where we index metadata and fulltexts in pdf. For the indexing we
need to extract the text out of the pdf, using iText. I now have problems
extracting the text from japanese pdfs where the font is not embedded. I either
get grumbled data or an exception that says 'encoding not supported EUC-H'.
Does anyone have an idea how to get the correct text for Japanese document with
font not embedded? Two pdfs are attached.
Thanks in advance
M.Hoppe
Code Snippet:
try {
PdfReader reader = new PdfReader(inputFile);
PRTokeniser token;
StringBuilder builder = new StringBuilder();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
byte[] pageBytes = reader.getPageContent(i);
if (pageBytes != null) {
token = new PRTokeniser(pageBytes);
while (true) {
try {
if (!token.nextToken()) {
break;
}
if (token.getTokenType() == PRTokeniser.TK_STRING) {
builder.append(token.getStringValue());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
[email protected]
FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/>
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische
Information mbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB
101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische
Information mbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB
101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php