Extract MsOffice properties
---------------------------

                 Key: TIKA-35
                 URL: https://issues.apache.org/jira/browse/TIKA-35
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.1-incubator
            Reporter: Rida Benjelloun
             Fix For: 0.1-incubator
         Attachments: tika35.patch

Hi,
I have developed a patch that allows MsOffice properties extraction. I wasn't 
able to extract the MsOffice properties and full text from a single 
inputstream, I always get this error : java.io.IOException Source code of 
java.io.IOException: Unable to read entire header; -1 bytes read;
expected 512 bytes. 
I don't know how they make it work in Nutch (any ideas ?).
To get it work, I have added "filePath" variable in the parser class, and I 
populate it from ParseUtils class. After that I create an inputStream from 
filePath or Url and I use it to extract properties and I use the default 
inputstream to extract full text.
I didn't commit this modification; I would like to have your opinions before.
Regards.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to