Are you initializing w a file or a stream? On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <peter.kronenb...@torch.ai> wrote:
> But how is TikaInputStream allowing me to re-use the stream without me > doing anything special? Is it automatically spooling to disk as needed? > > > > I wouldn’t say that I can’t afford to spool to disk. I’m just looking for > the most reasonable solution. I don’t know how big the streams are that > I’ll be processing. Obviously, if they’re big, the keeping them in memory > is not reasonable and disk is the only option. But for smaller streams, if > it can do it all in memory, that’s obviously better. And for my use case, > I don’t **always** have to re-read the stream. > > > > *From:* Tim Allison <talli...@apache.org> > *Sent:* Thursday, February 25, 2021 5:48 AM > *To:* user@tika.apache.org > *Cc:* lfcnas...@gmail.com > *Subject:* Re: Re-using a TikaStream > > > > My $0.02 would be to use TikaInputStream because that gets a lot more use > and is battle-tested. Within the last year or so, we started using > RereadableInputStream in one of the Microsoft format parsers so it is also > getting some use now. > > > > If you absolutely can't afford to spool to disk, then give > RereadableInputStream a try. > > > > The inputstreamfactories, in my mind, are somewhat work-arounds for other > use cases, e.g. retrying/batch etc. > > > > On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg < > peter.kronenb...@torch.ai> wrote: > > So this might be moot, because it seems that TikaInputStream is already > doing some magic and I’m not sure how. > > I was able to re-use the stream without doing anything special after a > call to parse. And in fact, I displayed stream.available() and > stream.position() before and after the call to parse, and the full stream > was still available at position 0. What is TikaInputStream doing to make > this happen? > > > > Just for some additional context, what I’m doing is running the file > through Tika and then, depending on the file type, I want to do some > additional non-tika processing. I thought that once the Tika parse was > done, the stream would be used up. > > > > What is going on? > > > > > > *From:* Peter Kronenberg <peter.kronenb...@torch.ai> > *Sent:* Tuesday, February 23, 2021 10:00 AM > *To:* user@tika.apache.org; lfcnas...@gmail.com > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > I just found the RereadableInputStream. This looks more like what I was > thinking. Is there any reason not to use it? What are the Tika best > practices? Pros/Cons of each approach? If RereadableInputStream works as > it’s supposed to, I’m not sure I see the advantage of InputStreamFactory > > > > *From:* Peter Kronenberg <peter.kronenb...@torch.ai> > *Sent:* Monday, February 22, 2021 8:30 PM > *To:* lfcnas...@gmail.com > *Cc:* user@tika.apache.org > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Oh ok. I didn’t realize I needed to write my own class to implement it. I > was looking for some sort of existing framework. > > > > What is the purpose of the 2 InputStreamFactory classes: > > > > I was re-reading some emails with Nick Burch back around Dec 22-23 and > maybe I mis-understood him, but it sounds like he was saying that > TiksInputStream was smart enough to automatically spool the stream to disk > to allow re-use. > > > > It seems to me that I need an extra pass through the data in order to save > to disk. I’m not starting from a File, but from a stream. So if I need to > read the stream twice, I really have to pass through the data 3 times, > correct? > > Unless there is a way to save to disk during the first pass > > > > (try/catch removed for simplicity) > > > > tis = TikaInputSream.get(InputStream); > > file = tis.getFile(); ç extra pass > > tis = TikaInputStream.get(new MyInputStreamFactory(file)); > > // first real pass > > InputStream is = tis.getInputStreamFactory().getInputStream() > > // second real pass > > } > > > > > > > > *From:* Luís Filipe Nassif <lfcnas...@gmail.com> > *Sent:* Monday, February 22, 2021 5:42 PM > *To:* Peter Kronenberg <peter.kronenb...@torch.ai> > *Cc:* user@tika.apache.org > *Subject:* Re: Re-using a TikaStream > > > > Something like: > > > > class MyInputStreamFactory implements InputStreamFactory{ > > > > private File file; > > > > public MyInputStreamFactory(File file){ > > this.file = file; > > } > > > > public InputStream getInputStream(){ > > return new FileInputStream(file); > > } > > } > > > > in your client code: > > > > Parser parser = new AutoDetectParser(); > > TikaInputStream tis = TikaInputStream.get(new MyInputStreamFactory(file)); > > parser.parse(tis, new ToTextContentHandler(), new Metadata(), new > ParseContext()); > > > > when you need to reuse the stream (into your parser): > > > > public void parse(InputStream stream, ContentHandler handler, Metadata > metadata, ParseContext context) > throws IOException, SAXException, TikaException { > > //(...) > > TikaInputStream tis = TikaInputStream.get(stream); > > if(tis.hasInputStreamFactory()){ > > try(InputStream is = tis.getInputStreamFactory().getInputStream()){ > > //consume the new stream > > } > > }else > > throw new IOException("not a reusable inputStream"); > > } > > > > Of course this is useful if you are not processing files, e.g. reading > files from the cloud or sockets. > > > > Regards, > > Luis > > > > > > Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg < > peter.kronenb...@torch.ai> escreveu: > > I sent this question late on Friday. Sending it again. Can you provide a > little more information how out to use the InputStreamFactory? > > > > *From:* Peter Kronenberg <peter.kronenb...@torch.ai> > *Sent:* Friday, February 19, 2021 5:10 PM > *To:* user@tika.apache.org; lfcnas...@gmail.com > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > There appear to be 2 InputStreamFactory classes: in tika-server-core and > tika-io. The one in server.core is the only one with a concrete class. > > I’m not quite sure I see how to use this. > > Normally, I create a TikaInputStream with > TikaInputStream.get(InputStream). How do I create it from an > InputStreamFactory? > > TikaInputStream.getInputStreamFactory() only returns a factory if the > TikaInputStream was created from a factory. > > Is there a good example of how this is used > > > > *From:* Peter Kronenberg <peter.kronenb...@torch.ai> > *Sent:* Friday, February 19, 2021 4:57 PM > *To:* user@tika.apache.org; lfcnas...@gmail.com > *Subject:* RE: Re-using a TikaStream > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Thanks. I thought that TikaInputStream already automatically saved to > disk to allow re-reading. > > > > *From:* Luís Filipe Nassif <lfcnas...@gmail.com> > *Sent:* Friday, February 19, 2021 3:44 PM > *To:* user@tika.apache.org > *Subject:* Re: Re-using a TikaStream > > > > You could call TikaInputStream.getPath() at the beginning of your parser, > it will spool to file if not file based. After consuming the original > inputStream, create a new one from the temp file created. > > > > If you are using 2.0.0-ALPHA, there is: > > > > > https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java > > > > Use with the new methods from TikaInputStream: > > public static TikaInputStream get(InputStreamFactory factory) > > public InputStreamFactory getInputStreamFactory() > > > > Hope this helps, > > Luis > > > > Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg < > peter.kronenb...@torch.ai> escreveu: > > If I finish parsing a TikaStream, can I re-use the stream (before it is > closed)? I know you said that there is some magic behind the scenes where > it spools it to a file. Can I just call reset() to start from the > beginning? > > > > Peter > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g> > WWW.TORCH.AI <http://www.torch.ai/> > > > > > >