[
https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532000
]
Chris A. Mattmann commented on TIKA-35:
---------------------------------------
Hi Folks:
// Instantiate it with your stream and a memory thresold:
RereadableInputStream stream = new RereadableInputStream(aStream, 1024 * 1024);
// Force reading entire stream to place it in storage for subsequent passes:
while (stream.read() != -1) {
// empty loop
}
Why not use the approach suggested by Keith above to wrap the rewind method
with a check to see if the stream is at the end of stream? We could require
RereadableInputStream to take an optional parameter, let's call it
"forceSeekOnRewind". By default, this would be set to false, but there could be
a method that would set this to true, e.g., "enableForceSeekOnRewind()". Then,
in the rewind method, it would first do something like:
if(forceSeekOnRewind){
while(read() != -1){
// empty loop
}
doRewind(); /* does the actual rewind work */
}
else{
if(EOF()){
/* at EOF, so go ahead and rewind */
doRewind();
}
/* else do nothing */
}
Cheers,
Chris
> Extract MsOffice properties
> ---------------------------
>
> Key: TIKA-35
> URL: https://issues.apache.org/jira/browse/TIKA-35
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.1-incubator
> Reporter: Rida Benjelloun
> Assignee: Rida Benjelloun
> Fix For: 0.1-incubator
>
> Attachments: RereadableInputStream.java,
> RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't
> able to extract the MsOffice properties and full text from a single
> inputstream, I always get this error : java.io.IOException Source code of
> java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes.
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I
> populate it from ParseUtils class. After that I create an inputStream from
> filePath or Url and I use it to extract properties and I use the default
> inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.