Hi Dominik
This is my java code, and I enclose a word document for you to have a look.
There are three symbols for Greek mu and the one in the first line next of "5"
is not converted into a html.
It's been missing. Other two symbols are fine.
Thank you
Teresa.
public class TestWordtoHtmlConverter {
public static void main(String[] args ) {
try {
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream(args[0]));
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
System.out.println(result);
} catch (Exception e) {
}
}
----------------------------------------
> Date: Sat, 20 Jun 2015 11:47:13 +0200
> Subject: Re: Missing greek character of mu from doc extraction
> From: [email protected]
> To: [email protected]
>
> Hi,
>
> Can you provide a sample document and the java code that you are using
> so it is easier to try to reproduce this?
>
> Thanks... Dominik.
>
> On Thu, Jun 4, 2015 at 10:19 AM, teressa kim <[email protected]> wrote:
>> Hi
>>
>> I have obsverved that the third greek character of mu "μ" in word doc file
>> is not extracted when converting to html file using WordToHtmlConverter
>> class. The mu character is
>> http://www.scarfboy.com/coding/unicode-tool?s=U%2BF06D
>>
>> Further, I also noticed that when I tried to apply the following statement
>> to the mu character, I got "0028" which I think it should be for "(" left
>> bracket.
>>
>> String hexCode =
>> Integer.toHexString(paragraph.text().codePointAt(index)).toUpperCase();
>>
>> Could you please help me how to extract this mu character from the doc
>> document?
>>
>> Thanks
>> T.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]