Rida -

Some InputStream implementations support mark and release.  Using this, you
can set a mark and then go back to it.  We may want to use that where
possible if it looks like it's more economical to do so.  Then, in other
cases, we could save the stream's bytes in a temporary file.

I think Jukka has put a lot of thought into this issue already, such as in
this message:

http://www.nabble.com/Tika-pipelines-%28was%3A-Tika-discussions-in-Amsterdam%29-tf3691029.html#a12882886

- Keith



JIRA [EMAIL PROTECTED] wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825
> ] 
> 
> Rida Benjelloun commented on TIKA-35:
> -------------------------------------
> 
> Hi Keith,
> I like the idea to save the content of the stream during the first pass. 
> Thanks
> 
> 
>> Extract MsOffice properties
>> ---------------------------
>>
>>                 Key: TIKA-35
>>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>>             Project: Tika
>>          Issue Type: Improvement
>>    Affects Versions: 0.1-incubator
>>            Reporter: Rida Benjelloun
>>             Fix For: 0.1-incubator
>>
>>         Attachments: tika35.patch, tika35.patch
>>
>>
>> Hi,
>> I have developed a patch that allows MsOffice properties extraction. I
>> wasn't able to extract the MsOffice properties and full text from a
>> single inputstream, I always get this error : java.io.IOException Source
>> code of java.io.IOException: Unable to read entire header; -1 bytes read;
>> expected 512 bytes. 
>> I don't know how they make it work in Nutch (any ideas ?).
>> To get it work, I have added "filePath" variable in the parser class, and
>> I populate it from ParseUtils class. After that I create an inputStream
>> from filePath or Url and I use it to extract properties and I use the
>> default inputstream to extract full text.
>> I didn't commit this modification; I would like to have your opinions
>> before.
>> Regards.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/-jira--Created%3A-%28TIKA-35%29-Extract-MsOffice-properties-tf4529774.html#a12929882
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to