Re: Missing greek character of mu from doc extraction

Dominik Stadler Fri, 26 Jun 2015 04:40:45 -0700

Hi,

sorry, but there is no attachment for me, can you resend?


Thanks... Dominik.

On Thu, Jun 25, 2015 at 10:42 AM, teressa kim <[email protected]> wrote:
> Hi Dominik
>
> This is my java code,  and I enclose a word document for you to have a look.
> There are three symbols for Greek mu and the one in the first line next of 
> "5" is not converted into a html.
> It's been missing. Other two symbols are fine.
>
> Thank you
> Teresa.
>
>
> public class TestWordtoHtmlConverter {
>
>     public static void main(String[] args ) {
>         try {
>         HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new 
> FileInputStream(args[0]));
>
>         WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
>                 DocumentBuilderFactory.newInstance().newDocumentBuilder()
>                         .newDocument());
>
>         wordToHtmlConverter.processDocument(wordDocument);
>         Document htmlDocument = wordToHtmlConverter.getDocument();
>         ByteArrayOutputStream out = new ByteArrayOutputStream();
>         DOMSource domSource = new DOMSource(htmlDocument);
>         StreamResult streamResult = new StreamResult(out);
>
>         TransformerFactory tf = TransformerFactory.newInstance();
>         Transformer serializer = tf.newTransformer();
>         serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
>         serializer.setOutputProperty(OutputKeys.INDENT, "yes");
>         serializer.setOutputProperty(OutputKeys.METHOD, "html");
>         serializer.transform(domSource, streamResult);
>         out.close();
>
>         String result = new String(out.toByteArray());
>         System.out.println(result);
>       } catch (Exception e) {
>       }
>
>     }
>
>
> ----------------------------------------
>> Date: Sat, 20 Jun 2015 11:47:13 +0200
>> Subject: Re: Missing greek character of mu from doc extraction
>> From: [email protected]
>> To: [email protected]
>>
>> Hi,
>>
>> Can you provide a sample document and the java code that you are using
>> so it is easier to try to reproduce this?
>>
>> Thanks... Dominik.
>>
>> On Thu, Jun 4, 2015 at 10:19 AM, teressa kim <[email protected]> 
>> wrote:
>>> Hi
>>>
>>> I have obsverved that the third greek character of mu "μ" in word doc file 
>>> is not extracted when converting to html file using WordToHtmlConverter 
>>> class. The mu character is 
>>> http://www.scarfboy.com/coding/unicode-tool?s=U%2BF06D
>>>
>>> Further, I also noticed that when I tried to apply the following statement 
>>> to the mu character, I got "0028" which I think it should be for "(" left 
>>> bracket.
>>>
>>> String hexCode = 
>>> Integer.toHexString(paragraph.text().codePointAt(index)).toUpperCase();
>>>
>>> Could you please help me how to extract this mu character from the doc 
>>> document?
>>>
>>> Thanks
>>> T.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Missing greek character of mu from doc extraction

Reply via email to