Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg <peter.kronenb...@torch.ai>
wrote:

> But how is TikaInputStream allowing me to re-use the stream without me
> doing anything special?   Is it automatically spooling to disk as needed?
>
>
>
> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for
> the most reasonable solution.  I don’t know how big the streams are that
> I’ll be processing.  Obviously, if they’re big, the keeping them in memory
> is not reasonable and disk is the only option.  But for smaller streams, if
> it can do it all in memory, that’s obviously better.  And for my use case,
> I don’t **always** have to re-read the stream.
>
>
>
> *From:* Tim Allison <talli...@apache.org>
> *Sent:* Thursday, February 25, 2021 5:48 AM
> *To:* user@tika.apache.org
> *Cc:* lfcnas...@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> My $0.02 would be to use TikaInputStream because that gets a lot more use
> and is battle-tested.  Within the last year or so, we started using
> RereadableInputStream in one of the Microsoft format parsers so it is also
> getting some use now.
>
>
>
> If you absolutely can't afford to spool to disk, then give
> RereadableInputStream a try.
>
>
>
> The inputstreamfactories, in my mind, are somewhat work-arounds for other
> use cases, e.g. retrying/batch etc.
>
>
>
> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
> peter.kronenb...@torch.ai> wrote:
>
> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg <peter.kronenb...@torch.ai>
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnas...@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg <peter.kronenb...@torch.ai>
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnas...@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif <lfcnas...@gmail.com>
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg <peter.kronenb...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
>     private File file;
>
>
>
>     public  MyInputStreamFactory(File file){
>
>         this.file = file;
>
>     }
>
>
>
>     public InputStream getInputStream(){
>
>         return new FileInputStream(file);
>
>     }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
>             throws IOException, SAXException, TikaException {
>
>    //(...)
>
>    TikaInputStream tis = TikaInputStream.get(stream);
>
>    if(tis.hasInputStreamFactory()){
>
>         try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>               //consume the new stream
>
>         }
>
>    }else
>
>        throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenb...@torch.ai> escreveu:
>
> I sent this question late on Friday.  Sending it again.  Can you provide a
> little more information how out to use the InputStreamFactory?
>
>
>
> *From:* Peter Kronenberg <peter.kronenb...@torch.ai>
> *Sent:* Friday, February 19, 2021 5:10 PM
> *To:* user@tika.apache.org; lfcnas...@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> There appear to be 2 InputStreamFactory classes: in tika-server-core and
> tika-io.  The one in server.core is the only one with a concrete class.
>
> I’m not quite sure I see how to use this.
>
> Normally, I create a TikaInputStream with
> TikaInputStream.get(InputStream).  How do I create it from an
> InputStreamFactory?
>
> TikaInputStream.getInputStreamFactory() only returns a factory if the
> TikaInputStream was created from a factory.
>
> Is there a good example of how this is used
>
>
>
> *From:* Peter Kronenberg <peter.kronenb...@torch.ai>
> *Sent:* Friday, February 19, 2021 4:57 PM
> *To:* user@tika.apache.org; lfcnas...@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Thanks.  I thought that TikaInputStream already automatically saved to
> disk to allow re-reading.
>
>
>
> *From:* Luís Filipe Nassif <lfcnas...@gmail.com>
> *Sent:* Friday, February 19, 2021 3:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> You could call TikaInputStream.getPath() at the beginning of your parser,
> it will spool to file if not file based. After consuming the original
> inputStream, create a new one from the temp file created.
>
>
>
> If you are using 2.0.0-ALPHA, there is:
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
>
>
>
> Use with the new methods from TikaInputStream:
>
> public static TikaInputStream get(InputStreamFactory factory)
>
> public InputStreamFactory getInputStreamFactory()
>
>
>
> Hope this helps,
>
> Luis
>
>
>
> Em sex., 19 de fev. de 2021 às 16:09, Peter Kronenberg <
> peter.kronenb...@torch.ai> escreveu:
>
> If I finish parsing a TikaStream, can I re-use the stream (before it is
> closed)?  I know you said that there is some magic behind the scenes where
> it spools it to a file.  Can I just call reset() to start from the
> beginning?
>
>
>
> Peter
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
>

Reply via email to