[
https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531901
]
Keith R. Bennett commented on TIKA-35:
--------------------------------------
Rida -
You're welcome. This class is functional, but not 100% robust or complete. In
particular, if you call rewind() before reaching end of stream on the first
pass, only those bytes already read will be saved to the buffer (memory or
disk). So if the first user of the stream may not read the whole stream, I'd
suggest forcing the initial pass to read the whole stream by doing something
like:
// Instantiate it with your stream and a memory thresold:
RereadableInputStream stream = new RereadableInputStream(aStream, 1024 * 1024);
// Force reading entire stream to place it in storage for subsequent passes:
while (stream.read() != -1) {
// empty loop
}
// Rewind the stream so that the next use of the stream will begin at the
beginning of the stream,
// and read from the stored copy:
stream.rewind();
- Keith
> Extract MsOffice properties
> ---------------------------
>
> Key: TIKA-35
> URL: https://issues.apache.org/jira/browse/TIKA-35
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.1-incubator
> Reporter: Rida Benjelloun
> Assignee: Rida Benjelloun
> Fix For: 0.1-incubator
>
> Attachments: RereadableInputStream.java,
> RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't
> able to extract the MsOffice properties and full text from a single
> inputstream, I always get this error : java.io.IOException Source code of
> java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes.
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I
> populate it from ParseUtils class. After that I create an inputStream from
> filePath or Url and I use it to extract properties and I use the default
> inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.