Hi, sorry, but there is no attachment for me, can you resend?
Thanks... Dominik. On Thu, Jun 25, 2015 at 10:42 AM, teressa kim <[email protected]> wrote: > Hi Dominik > > This is my java code, and I enclose a word document for you to have a look. > There are three symbols for Greek mu and the one in the first line next of > "5" is not converted into a html. > It's been missing. Other two symbols are fine. > > Thank you > Teresa. > > > public class TestWordtoHtmlConverter { > > public static void main(String[] args ) { > try { > HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new > FileInputStream(args[0])); > > WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( > DocumentBuilderFactory.newInstance().newDocumentBuilder() > .newDocument()); > > wordToHtmlConverter.processDocument(wordDocument); > Document htmlDocument = wordToHtmlConverter.getDocument(); > ByteArrayOutputStream out = new ByteArrayOutputStream(); > DOMSource domSource = new DOMSource(htmlDocument); > StreamResult streamResult = new StreamResult(out); > > TransformerFactory tf = TransformerFactory.newInstance(); > Transformer serializer = tf.newTransformer(); > serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); > serializer.setOutputProperty(OutputKeys.INDENT, "yes"); > serializer.setOutputProperty(OutputKeys.METHOD, "html"); > serializer.transform(domSource, streamResult); > out.close(); > > String result = new String(out.toByteArray()); > System.out.println(result); > } catch (Exception e) { > } > > } > > > ---------------------------------------- >> Date: Sat, 20 Jun 2015 11:47:13 +0200 >> Subject: Re: Missing greek character of mu from doc extraction >> From: [email protected] >> To: [email protected] >> >> Hi, >> >> Can you provide a sample document and the java code that you are using >> so it is easier to try to reproduce this? >> >> Thanks... Dominik. >> >> On Thu, Jun 4, 2015 at 10:19 AM, teressa kim <[email protected]> >> wrote: >>> Hi >>> >>> I have obsverved that the third greek character of mu "μ" in word doc file >>> is not extracted when converting to html file using WordToHtmlConverter >>> class. The mu character is >>> http://www.scarfboy.com/coding/unicode-tool?s=U%2BF06D >>> >>> Further, I also noticed that when I tried to apply the following statement >>> to the mu character, I got "0028" which I think it should be for "(" left >>> bracket. >>> >>> String hexCode = >>> Integer.toHexString(paragraph.text().codePointAt(index)).toUpperCase(); >>> >>> Could you please help me how to extract this mu character from the doc >>> document? >>> >>> Thanks >>> T. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
