Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

kbennett Fri, 28 Sep 2007 08:32:35 -0700

Rida & All -

I just did some research on mark and release and found out that IMO it will
not help us.  It is true that we could wrap every stream in a
BufferedReader, which is guaranteed to support mark and release.  However,
when mark() is called, it requires a parameter specifying the readahead
limit (the number of characters to save for a possible reset() call).  Since
we are dealing with documents of arbitrary length, it would not be practical
to rely on this.  Those characters are stored in memory, so even when we
implement chunking, we would still have a memory limitation regarding
document size.


Unless we can reuse the resource identifier (file, URL, etc.) for multiple
passes, I think we'll have to save the stream in a temporary file when we
read it the first time, and then read it from that file on subsequent
passes.  I suppose that's something each parser implementation would decide
for itself.  This, of course, would not remove a size limitation, but it
would change it to be the amount of usable disk space rather than memory.

It would be nice if there were some implementation of BufferedReader that
used disk instead of memory if the readaheadLimit exceeded a threshold.  If
not, we may need to write our own.  On the other hand, I'm sure we're not
the first people to encounter this problem; I wonder if there are better
solutions out there already.

- Keith







kbennett wrote:
> 
> Rida -
> 
> Some InputStream implementations support mark and release.  Using this,
> you can set a mark and then go back to it.  We may want to use that where
> possible if it looks like it's more economical to do so.  Then, in other
> cases, we could save the stream's bytes in a temporary file.
> 
> I think Jukka has put a lot of thought into this issue already, such as in
> this message:
> 
> http://www.nabble.com/Tika-pipelines-%28was%3A-Tika-discussions-in-Amsterdam%29-tf3691029.html#a12882886
> 
> - Keith
> 
> 
> 
> JIRA [EMAIL PROTECTED] wrote:
>> 
>> 
>>     [
>> https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825
>> ] 
>> 
>> Rida Benjelloun commented on TIKA-35:
>> -------------------------------------
>> 
>> Hi Keith,
>> I like the idea to save the content of the stream during the first pass. 
>> Thanks
>> 
>> 
>>> Extract MsOffice properties
>>> ---------------------------
>>>
>>>                 Key: TIKA-35
>>>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>>>             Project: Tika
>>>          Issue Type: Improvement
>>>    Affects Versions: 0.1-incubator
>>>            Reporter: Rida Benjelloun
>>>             Fix For: 0.1-incubator
>>>
>>>         Attachments: tika35.patch, tika35.patch
>>>
>>>
>>> Hi,
>>> I have developed a patch that allows MsOffice properties extraction. I
>>> wasn't able to extract the MsOffice properties and full text from a
>>> single inputstream, I always get this error : java.io.IOException Source
>>> code of java.io.IOException: Unable to read entire header; -1 bytes
>>> read;
>>> expected 512 bytes. 
>>> I don't know how they make it work in Nutch (any ideas ?).
>>> To get it work, I have added "filePath" variable in the parser class,
>>> and I populate it from ParseUtils class. After that I create an
>>> inputStream from filePath or Url and I use it to extract properties and
>>> I use the default inputstream to extract full text.
>>> I didn't commit this modification; I would like to have your opinions
>>> before.
>>> Regards.
>> 
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/-jira--Created%3A-%28TIKA-35%29-Extract-MsOffice-properties-tf4529774.html#a12942832
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Reply via email to