Rida & All - I just did some research on mark and release and found out that IMO it will not help us. It is true that we could wrap every stream in a BufferedReader, which is guaranteed to support mark and release. However, when mark() is called, it requires a parameter specifying the readahead limit (the number of characters to save for a possible reset() call). Since we are dealing with documents of arbitrary length, it would not be practical to rely on this. Those characters are stored in memory, so even when we implement chunking, we would still have a memory limitation regarding document size.
Unless we can reuse the resource identifier (file, URL, etc.) for multiple passes, I think we'll have to save the stream in a temporary file when we read it the first time, and then read it from that file on subsequent passes. I suppose that's something each parser implementation would decide for itself. This, of course, would not remove a size limitation, but it would change it to be the amount of usable disk space rather than memory. It would be nice if there were some implementation of BufferedReader that used disk instead of memory if the readaheadLimit exceeded a threshold. If not, we may need to write our own. On the other hand, I'm sure we're not the first people to encounter this problem; I wonder if there are better solutions out there already. - Keith kbennett wrote: > > Rida - > > Some InputStream implementations support mark and release. Using this, > you can set a mark and then go back to it. We may want to use that where > possible if it looks like it's more economical to do so. Then, in other > cases, we could save the stream's bytes in a temporary file. > > I think Jukka has put a lot of thought into this issue already, such as in > this message: > > http://www.nabble.com/Tika-pipelines-%28was%3A-Tika-discussions-in-Amsterdam%29-tf3691029.html#a12882886 > > - Keith > > > > JIRA [EMAIL PROTECTED] wrote: >> >> >> [ >> https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530825 >> ] >> >> Rida Benjelloun commented on TIKA-35: >> ------------------------------------- >> >> Hi Keith, >> I like the idea to save the content of the stream during the first pass. >> Thanks >> >> >>> Extract MsOffice properties >>> --------------------------- >>> >>> Key: TIKA-35 >>> URL: https://issues.apache.org/jira/browse/TIKA-35 >>> Project: Tika >>> Issue Type: Improvement >>> Affects Versions: 0.1-incubator >>> Reporter: Rida Benjelloun >>> Fix For: 0.1-incubator >>> >>> Attachments: tika35.patch, tika35.patch >>> >>> >>> Hi, >>> I have developed a patch that allows MsOffice properties extraction. I >>> wasn't able to extract the MsOffice properties and full text from a >>> single inputstream, I always get this error : java.io.IOException Source >>> code of java.io.IOException: Unable to read entire header; -1 bytes >>> read; >>> expected 512 bytes. >>> I don't know how they make it work in Nutch (any ideas ?). >>> To get it work, I have added "filePath" variable in the parser class, >>> and I populate it from ParseUtils class. After that I create an >>> inputStream from filePath or Url and I use it to extract properties and >>> I use the default inputstream to extract full text. >>> I didn't commit this modification; I would like to have your opinions >>> before. >>> Regards. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> > > -- View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-35%29-Extract-MsOffice-properties-tf4529774.html#a12942832 Sent from the Apache Tika - Development mailing list archive at Nabble.com.
