HWPF image extraction problem

Vasko Gjurovski Wed, 11 Nov 2009 03:57:04 -0800

Hi all,

I am trying to extract a whole .doc document and have managed to do great
with text, tables and bullets, but I remain stuck regarding the images.
AFAIK the images in the MSWord file are stored as .emz, which is a gzip-ed
emf file. This is my code:



        List picList = picTable.getAllPictures();
        Picture picture = (Picture) picList.get(picC);
        String folderPath = PATH;
        String emzPath = folderPath+picture.suggestFullFileName()+".emz";
        OutputStream image = new FileOutputStream(emzPath);
        picture.writeImageContent(image);
        image.close();
        InputStream is = new FileInputStream(new File(emzPath));
        GZIPInputStream gzipis = new GZIPInputStream(is);
        OutputStream emfos = new FileOutputStream(new
File(folderPath+picture.suggestFullFileName()+".emf"));
        byte[] buf = new byte[1024];
        int len;
        while ((len = gzipis.read(buf)) > 0) {
          emfos.write(buf, 0, len);
        }
        gzipis.close();
        emfos.close();

This should do the extraction of the emf image file from the emz. However my
code fails to do so because the gzipis (the supposed gzip InputStream) is
not a gzip at all! It seems that the extracted image is not an emz file. I
tried another approach, to save the word file as HTML (which stores the
images in a separate folder) and I got the images as .emz and gif. Now the
size of the .emz file from that extraction and my extraction defer in bytes,
meaning that the extraction is done wrong? I have been able to open the .emz
file from the HTML extraction with gzip, but not my extracted file, getting
an not good gzip file?

Any help with this?

Best regards,
Vasko

HWPF image extraction problem

Reply via email to