[
https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith R. Bennett updated TIKA-35:
---------------------------------
Attachment: RereadableInputStreamTest.java
RereadableInputStream.java
Attached are a first pass at a rereadable stream class and a basic unit test
that illustrates that it works (basically ;)).
This stream class wraps the document's input stream and saves its content when
the passed stream is read.
It supports a memory threshold; if the total size read is no more than this
threshold, the data is stored in a byte [], and subsequent rereads of the
stream are read from a ByteArrayInputStream. If the total size exceeds the
threshold, the data is stored in a File, and subsequent passes read a buffered
FileInputStream.
If you place these files in src/main/java/org/apache/tika/utils and
src/test/java/org/apache/tika/utils, you should be able to compile them and run
the test.
Rereading the stream is accomplished by calling rewind(). Currently rewind()
closes the input stream originally passed, but we may want to change that.
> Extract MsOffice properties
> ---------------------------
>
> Key: TIKA-35
> URL: https://issues.apache.org/jira/browse/TIKA-35
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.1-incubator
> Reporter: Rida Benjelloun
> Assignee: Rida Benjelloun
> Fix For: 0.1-incubator
>
> Attachments: RereadableInputStream.java,
> RereadableInputStreamTest.java, tika35.patch, tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't
> able to extract the MsOffice properties and full text from a single
> inputstream, I always get this error : java.io.IOException Source code of
> java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes.
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I
> populate it from ParseUtils class. After that I create an inputStream from
> filePath or Url and I use it to extract properties and I use the default
> inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.