Oh ok. I didn’t realize I needed to write my own class to implement it. I was looking for some sort of existing framework.
What is the purpose of the 2 InputStreamFactory classes: I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use. It seems to me that I need an extra pass through the data in order to save to disk. I’m not starting from a File, but from a stream. So if I need to read the stream twice, I really have to pass through the data 3 times, correct? Unless there is a way to save to disk during the first pass (try/catch removed for simplicity) tis = TikaInputSream.get(InputStream); file = tis.getFile(); <== extra pass tis = TikaInputStream.get(new MyInputStreamFactory(file)); // first real pass InputStream is = tis.getInputStreamFactory().getInputStream() // second real pass } From: Luís Filipe Nassif <lfcnas...@gmail.com> Sent: Monday, February 22, 2021 5:42 PM To: Peter Kronenberg <peter.kronenb...@torch.ai> Cc: user@tika.apache.org Subject: Re: Re-using a TikaStream Something like: class MyInputStreamFactory implements InputStreamFactory{ private File file; public MyInputStreamFactory(File file){ this.file = file; } public InputStream getInputStream(){ return new FileInputStream(file); } } in your client code: Parser parser = new AutoDetectParser(); TikaInputStream tis = TikaInputStream.get(new MyInputStreamFactory(file)); parser.parse(tis, new ToTextContentHandler(), new Metadata(), new ParseContext()); when you need to reuse the stream (into your parser): public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { //(...) TikaInputStream tis = TikaInputStream.get(stream); if(tis.hasInputStreamFactory()){ try(InputStream is = tis.getInputStreamFactory().getInputStream()){ //consume the new stream } }else throw new IOException("not a reusable inputStream"); } Of course this is useful if you are not processing files, e.g. reading files from the cloud or sockets. Regards, Luis Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> escreveu: I sent this question late on Friday. Sending it again. Can you provide a little more information how out to use the InputStreamFactory? From: Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> Sent: Friday, February 19, 2021 5:10 PM To: user@tika.apache.org<mailto:user@tika.apache.org>; lfcnas...@gmail.com<mailto:lfcnas...@gmail.com> Subject: RE: Re-using a TikaStream This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe. There appear to be 2 InputStreamFactory classes: in tika-server-core and tika-io. The one in server.core is the only one with a concrete class. I’m not quite sure I see how to use this. Normally, I create a TikaInputStream with TikaInputStream.get(InputStream). How do I create it from an InputStreamFactory? TikaInputStream.getInputStreamFactory() only returns a factory if the TikaInputStream was created from a factory. Is there a good example of how this is used From: Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> Sent: Friday, February 19, 2021 4:57 PM To: user@tika.apache.org<mailto:user@tika.apache.org>; lfcnas...@gmail.com<mailto:lfcnas...@gmail.com> Subject: RE: Re-using a TikaStream This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe. Thanks. I thought that TikaInputStream already automatically saved to disk to allow re-reading. From: Luís Filipe Nassif <lfcnas...@gmail.com<mailto:lfcnas...@gmail.com>> Sent: Friday, February 19, 2021 3:44 PM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Re: Re-using a TikaStream You could call TikaInputStream.getPath() at the beginning of your parser, it will spool to file if not file based. After consuming the original inputStream, create a new one from the temp file created. If you are using 2.0.0-ALPHA, there is: https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java Use with the new methods from TikaInputStream: public static TikaInputStream get(InputStreamFactory factory) public InputStreamFactory getInputStreamFactory() Hope this helps, Luis Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> escreveu: If I finish parsing a TikaStream, can I re-use the stream (before it is closed)? I know you said that there is some magic behind the scenes where it spools it to a file. Can I just call reset() to start from the beginning? Peter Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 4303 W. 119th St., Leawood, KS 66209 WWW.TORCH.AI<http://www.torch.ai/>