RE: Re-using a TikaStream

2021-02-23 Thread Nick Burch

On Tue, 23 Feb 2021, Peter Kronenberg wrote:
I was re-reading some emails with Nick Burch back around Dec 22-23 and 
maybe I mis-understood him, but it sounds like he was saying that 
TiksInputStream was smart enough to automatically spool the stream to 
disk to allow re-use.


If a parser knows it is going to need to have a File, or knows it will 
need to re-read multiple times, it can tell TikaInputStream which will 
save to a temp file. If you as the caller know this, you can force it with 
a getFile / getPath call


If spooling to a local file is expensive, but restarting the stream 
reading is cheap, then the InputStreamFactory can be used instead. 
Typically that's with cloud storage or the like


Nick


RE: Re-using a TikaStream

2021-02-23 Thread Peter Kronenberg
I just found the RereadableInputStream.  This looks more like what I was 
thinking.  Is there any reason not to use it?  What are the Tika best 
practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s 
supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg 
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was 
looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I 
mis-understood him, but it sounds like he was saying that TiksInputStream was 
smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to 
disk.  I’m not starting from a File, but from a stream.  So if I need to read 
the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif mailto:lfcnas...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

private File file;

public  MyInputStreamFactory(File file){
this.file = file;
}

public InputStream getInputStream(){
return new FileInputStream(file);
}
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new 
ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
try(InputStream is = tis.getInputStreamFactory().getInputStream()){
  //consume the new stream
}
   }else
   throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files 
from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a 
little more information how out to use the InputStreamFactory?

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

There appear to be 2 InputStreamFactory classes: in tika-server-core and 
tika-io.  The one in server.core is the only one with a concrete class.
I’m not quite sure I see how to use this.
Normally, I create a TikaInputStream with TikaInputStream.get(InputStream).  
How do I create it from an InputStreamFactory?
TikaInputStream.getInputStreamFactory() only returns a factory if the 
TikaInputStream was created from a factory.
Is there a good example of how this is used

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Friday, February 19, 2021 4:57 PM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Thanks.  I thought that TikaInputStream already automatically saved to disk to 
allow re-reading.

From: Luís Filipe Nassif mailto:lfcnas...@gmail.com>>
Sent: Friday, February 19, 2021 3:44 PM
To: user@tika.apache.org
Su

RE: Re-using a TikaStream

2021-02-23 Thread Peter Kronenberg
So this might be moot, because it seems that TikaInputStream is already doing 
some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to 
parse.  And in fact, I displayed stream.available() and stream.position() 
before and after the call to parse, and the full stream was still available at 
position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through 
Tika and then, depending on the file type, I want to do some additional 
non-tika processing.  I thought that once the Tika parse was done, the stream 
would be used up.

What is going on?


From: Peter Kronenberg 
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org; lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

I just found the RereadableInputStream.  This looks more like what I was 
thinking.  Is there any reason not to use it?  What are the Tika best 
practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s 
supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was 
looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I 
mis-understood him, but it sounds like he was saying that TiksInputStream was 
smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to 
disk.  I’m not starting from a File, but from a stream.  So if I need to read 
the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif mailto:lfcnas...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

private File file;

public  MyInputStreamFactory(File file){
this.file = file;
}

public InputStream getInputStream(){
return new FileInputStream(file);
}
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new 
ParseContext());

when you need to reuse the stream (into your parser):

public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
throws IOException, SAXException, TikaException {
   //(...)
   TikaInputStream tis = TikaInputStream.get(stream);
   if(tis.hasInputStreamFactory()){
try(InputStream is = tis.getInputStreamFactory().getInputStream()){
  //consume the new stream
}
   }else
   throw new IOException("not a reusable inputStream");
 }

Of course this is useful if you are not processing files, e.g. reading files 
from the cloud or sockets.

Regards,
Luis


Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> escreveu:
I sent this question late on Friday.  Sending it again.  Can you provide a 
little more information how out to use the InputStreamFactory?

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Friday, February 19, 2021 5:10 PM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

There appear to be 2 InputStreamFactory