RE: Missing greek character of mu from doc extraction

teressa kim Thu, 25 Jun 2015 01:42:40 -0700

Hi Dominik

This is my java code,  and I enclose a word document for you to have a look.
There are three symbols for Greek mu and the one in the first line next of "5" 
is not converted into a html.
It's been missing. Other two symbols are fine.


Thank you
Teresa.


public class TestWordtoHtmlConverter {

    public static void main(String[] args ) {
        try {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new 
FileInputStream(args[0]));

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());

        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();

        String result = new String(out.toByteArray());
        System.out.println(result);
      } catch (Exception e) {
      }

    }


----------------------------------------
> Date: Sat, 20 Jun 2015 11:47:13 +0200
> Subject: Re: Missing greek character of mu from doc extraction
> From: [email protected]
> To: [email protected]
>
> Hi,
>
> Can you provide a sample document and the java code that you are using
> so it is easier to try to reproduce this?
>
> Thanks... Dominik.
>
> On Thu, Jun 4, 2015 at 10:19 AM, teressa kim <[email protected]> wrote:
>> Hi
>>
>> I have obsverved that the third greek character of mu "μ" in word doc file 
>> is not extracted when converting to html file using WordToHtmlConverter 
>> class. The mu character is 
>> http://www.scarfboy.com/coding/unicode-tool?s=U%2BF06D
>>
>> Further, I also noticed that when I tried to apply the following statement 
>> to the mu character, I got "0028" which I think it should be for "(" left 
>> bracket.
>>
>> String hexCode = 
>> Integer.toHexString(paragraph.text().codePointAt(index)).toUpperCase();
>>
>> Could you please help me how to extract this mu character from the doc 
>> document?
>>
>> Thanks
>> T.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Missing greek character of mu from doc extraction

Reply via email to