RE: Re-using a TikaStream

2021-02-25 Thread Peter Kronenberg
Or reading from the cloud, either Google or AWS, in which case I also get a 
stream.   I know what the file name is, but can’t really use it

From: Peter Kronenberg 
Sent: Thursday, February 25, 2021 11:19 AM
To: talli...@apache.org
Cc: lfcnas...@gmail.com; user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

With a stream.  I am reading arbitrary streams and one of the goals is to 
figure out what it is. So there is no file backing it.

From: Tim Allison mailto:talli...@apache.org>>
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: lfcnas...@gmail.com; 
user@tika.apache.org
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing 
anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the 
most reasonable solution.  I don’t know how big the streams are that I’ll be 
processing.  Obviously, if they’re big, the keeping them in memory is not 
reasonable and disk is the only option.  But for smaller streams, if it can do 
it all in memory, that’s obviously better.  And for my use case, I don’t 
*always* have to re-read the stream.

From: Tim Allison mailto:talli...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org
Cc: lfcnas...@gmail.com
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and 
is battle-tested.  Within the last year or so, we started using 
RereadableInputStream in one of the Microsoft format parsers so it is also 
getting some use now.

If you absolutely can't afford to spool to disk, then give 
RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use 
cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing 
some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to 
parse.  And in fact, I displayed stream.available() and stream.position() 
before and after the call to parse, and the full stream was still available at 
position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through 
Tika and then, depending on the file type, I want to do some additional 
non-tika processing.  I thought that once the Tika parse was done, the stream 
would be used up.

What is going on?


From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

I just found the RereadableInputStream.  This looks more like what I was 
thinking.  Is there any reason not to use it?  What are the Tika best 
practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s 
supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was 
looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I 
mis-understood him, but it sounds like he was saying that TiksInputStream was 
smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to 
disk.  I’m not starting from a File, but from a stream.  So if I need to read 
the strea

RE: Re-using a TikaStream

2021-02-25 Thread Peter Kronenberg
With a stream.  I am reading arbitrary streams and one of the goals is to 
figure out what it is. So there is no file backing it.

From: Tim Allison 
Sent: Thursday, February 25, 2021 11:11 AM
To: Peter Kronenberg 
Cc: lfcnas...@gmail.com; user@tika.apache.org
Subject: Re: Re-using a TikaStream

Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
But how is TikaInputStream allowing me to re-use the stream without me doing 
anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the 
most reasonable solution.  I don’t know how big the streams are that I’ll be 
processing.  Obviously, if they’re big, the keeping them in memory is not 
reasonable and disk is the only option.  But for smaller streams, if it can do 
it all in memory, that’s obviously better.  And for my use case, I don’t 
*always* have to re-read the stream.

From: Tim Allison mailto:talli...@apache.org>>
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org
Cc: lfcnas...@gmail.com
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and 
is battle-tested.  Within the last year or so, we started using 
RereadableInputStream in one of the Microsoft format parsers so it is also 
getting some use now.

If you absolutely can't afford to spool to disk, then give 
RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use 
cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing 
some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to 
parse.  And in fact, I displayed stream.available() and stream.position() 
before and after the call to parse, and the full stream was still available at 
position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through 
Tika and then, depending on the file type, I want to do some additional 
non-tika processing.  I thought that once the Tika parse was done, the stream 
would be used up.

What is going on?


From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

I just found the RereadableInputStream.  This looks more like what I was 
thinking.  Is there any reason not to use it?  What are the Tika best 
practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s 
supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was 
looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I 
mis-understood him, but it sounds like he was saying that TiksInputStream was 
smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to 
disk.  I’m not starting from a File, but from a stream.  So if I need to read 
the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif mailto:lfcnas...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamF

Re: Re-using a TikaStream

2021-02-25 Thread Tim Allison
Are you initializing w a file or a stream?

On Thu, Feb 25, 2021 at 9:00 AM Peter Kronenberg 
wrote:

> But how is TikaInputStream allowing me to re-use the stream without me
> doing anything special?   Is it automatically spooling to disk as needed?
>
>
>
> I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for
> the most reasonable solution.  I don’t know how big the streams are that
> I’ll be processing.  Obviously, if they’re big, the keeping them in memory
> is not reasonable and disk is the only option.  But for smaller streams, if
> it can do it all in memory, that’s obviously better.  And for my use case,
> I don’t **always** have to re-read the stream.
>
>
>
> *From:* Tim Allison 
> *Sent:* Thursday, February 25, 2021 5:48 AM
> *To:* user@tika.apache.org
> *Cc:* lfcnas...@gmail.com
> *Subject:* Re: Re-using a TikaStream
>
>
>
> My $0.02 would be to use TikaInputStream because that gets a lot more use
> and is battle-tested.  Within the last year or so, we started using
> RereadableInputStream in one of the Microsoft format parsers so it is also
> getting some use now.
>
>
>
> If you absolutely can't afford to spool to disk, then give
> RereadableInputStream a try.
>
>
>
> The inputstreamfactories, in my mind, are somewhat work-arounds for other
> use cases, e.g. retrying/batch etc.
>
>
>
> On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg <
> peter.kronenb...@torch.ai> wrote:
>
> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg 
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnas...@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg 
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnas...@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif 
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg 
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
> private File file;
>
>
>
> public  MyInputStreamFactory(File file){
>
> this.file = file;
>
> }
>
>
>
> public InputStream getInputStream(){
>
> return new FileInputStream(file);
>
> }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStrea

RE: Re-using a TikaStream

2021-02-25 Thread Peter Kronenberg
But how is TikaInputStream allowing me to re-use the stream without me doing 
anything special?   Is it automatically spooling to disk as needed?

I wouldn’t say that I can’t afford to spool to disk.  I’m just looking for the 
most reasonable solution.  I don’t know how big the streams are that I’ll be 
processing.  Obviously, if they’re big, the keeping them in memory is not 
reasonable and disk is the only option.  But for smaller streams, if it can do 
it all in memory, that’s obviously better.  And for my use case, I don’t 
*always* have to re-read the stream.

From: Tim Allison 
Sent: Thursday, February 25, 2021 5:48 AM
To: user@tika.apache.org
Cc: lfcnas...@gmail.com
Subject: Re: Re-using a TikaStream

My $0.02 would be to use TikaInputStream because that gets a lot more use and 
is battle-tested.  Within the last year or so, we started using 
RereadableInputStream in one of the Microsoft format parsers so it is also 
getting some use now.

If you absolutely can't afford to spool to disk, then give 
RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other use 
cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
So this might be moot, because it seems that TikaInputStream is already doing 
some magic and I’m not sure how.
I was able to re-use the stream without doing anything special after a call to 
parse.  And in fact, I displayed stream.available() and stream.position() 
before and after the call to parse, and the full stream was still available at 
position 0.  What is TikaInputStream doing to make this happen?

Just for some additional context, what I’m doing is running the file through 
Tika and then, depending on the file type, I want to do some additional 
non-tika processing.  I thought that once the Tika parse was done, the stream 
would be used up.

What is going on?


From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Tuesday, February 23, 2021 10:00 AM
To: user@tika.apache.org; 
lfcnas...@gmail.com
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

I just found the RereadableInputStream.  This looks more like what I was 
thinking.  Is there any reason not to use it?  What are the Tika best 
practices?  Pros/Cons of each approach?  If RereadableInputStream works as it’s 
supposed to, I’m not sure I see the advantage of InputStreamFactory

From: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Sent: Monday, February 22, 2021 8:30 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: RE: Re-using a TikaStream

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Oh ok.  I didn’t realize I needed to write my own class to implement it. I  was 
looking for some sort of existing framework.

What is the purpose of the 2 InputStreamFactory classes:

I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I 
mis-understood him, but it sounds like he was saying that TiksInputStream was 
smart enough to automatically spool the stream to disk to allow re-use.

It seems to me that I need an extra pass through the data in order to save to 
disk.  I’m not starting from a File, but from a stream.  So if I need to read 
the stream twice, I really have to pass through the data 3 times, correct?
Unless there is a way to save to disk during the first pass

(try/catch removed for simplicity)

tis = TikaInputSream.get(InputStream);
file = tis.getFile();   <== extra pass
tis =  TikaInputStream.get(new MyInputStreamFactory(file));
// first real pass
InputStream is = tis.getInputStreamFactory().getInputStream()
// second real pass
}



From: Luís Filipe Nassif mailto:lfcnas...@gmail.com>>
Sent: Monday, February 22, 2021 5:42 PM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Re-using a TikaStream

Something like:

class MyInputStreamFactory implements InputStreamFactory{

private File file;

public  MyInputStreamFactory(File file){
this.file = file;
}

public InputStream getInputStream(){
return new FileInputStream(file);
}
}

in your client code:

Parser parser = new AutoDetectParser();
TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
parser.parse(tis, new ToTextContentHandler(), new Metadata(), new 
ParseContext());

when you need to reuse the stream (into your parser):

public 

Re: Re-using a TikaStream

2021-02-25 Thread Tim Allison
My $0.02 would be to use TikaInputStream because that gets a lot more use
and is battle-tested.  Within the last year or so, we started using
RereadableInputStream in one of the Microsoft format parsers so it is also
getting some use now.

If you absolutely can't afford to spool to disk, then give
RereadableInputStream a try.

The inputstreamfactories, in my mind, are somewhat work-arounds for other
use cases, e.g. retrying/batch etc.

On Tue, Feb 23, 2021 at 11:41 AM Peter Kronenberg 
wrote:

> So this might be moot, because it seems that TikaInputStream is already
> doing some magic and I’m not sure how.
>
> I was able to re-use the stream without doing anything special after a
> call to parse.  And in fact, I displayed stream.available() and
> stream.position() before and after the call to parse, and the full stream
> was still available at position 0.  What is TikaInputStream doing to make
> this happen?
>
>
>
> Just for some additional context, what I’m doing is running the file
> through Tika and then, depending on the file type, I want to do some
> additional non-tika processing.  I thought that once the Tika parse was
> done, the stream would be used up.
>
>
>
> What is going on?
>
>
>
>
>
> *From:* Peter Kronenberg 
> *Sent:* Tuesday, February 23, 2021 10:00 AM
> *To:* user@tika.apache.org; lfcnas...@gmail.com
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I just found the RereadableInputStream.  This looks more like what I was
> thinking.  Is there any reason not to use it?  What are the Tika best
> practices?  Pros/Cons of each approach?  If RereadableInputStream works as
> it’s supposed to, I’m not sure I see the advantage of InputStreamFactory
>
>
>
> *From:* Peter Kronenberg 
> *Sent:* Monday, February 22, 2021 8:30 PM
> *To:* lfcnas...@gmail.com
> *Cc:* user@tika.apache.org
> *Subject:* RE: Re-using a TikaStream
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Oh ok.  I didn’t realize I needed to write my own class to implement it. I
>  was looking for some sort of existing framework.
>
>
>
> What is the purpose of the 2 InputStreamFactory classes:
>
>
>
> I was re-reading some emails with Nick Burch back around Dec 22-23 and
> maybe I mis-understood him, but it sounds like he was saying that
> TiksInputStream was smart enough to automatically spool the stream to disk
> to allow re-use.
>
>
>
> It seems to me that I need an extra pass through the data in order to save
> to disk.  I’m not starting from a File, but from a stream.  So if I need to
> read the stream twice, I really have to pass through the data 3 times,
> correct?
>
> Unless there is a way to save to disk during the first pass
>
>
>
> (try/catch removed for simplicity)
>
>
>
> tis = TikaInputSream.get(InputStream);
>
> file = tis.getFile();   ç extra pass
>
> tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> // first real pass
>
> InputStream is = tis.getInputStreamFactory().getInputStream()
>
> // second real pass
>
> }
>
>
>
>
>
>
>
> *From:* Luís Filipe Nassif 
> *Sent:* Monday, February 22, 2021 5:42 PM
> *To:* Peter Kronenberg 
> *Cc:* user@tika.apache.org
> *Subject:* Re: Re-using a TikaStream
>
>
>
> Something like:
>
>
>
> class MyInputStreamFactory implements InputStreamFactory{
>
>
>
> private File file;
>
>
>
> public  MyInputStreamFactory(File file){
>
> this.file = file;
>
> }
>
>
>
> public InputStream getInputStream(){
>
> return new FileInputStream(file);
>
> }
>
> }
>
>
>
> in your client code:
>
>
>
> Parser parser = new AutoDetectParser();
>
> TikaInputStream tis =  TikaInputStream.get(new MyInputStreamFactory(file));
>
> parser.parse(tis, new ToTextContentHandler(), new Metadata(), new
> ParseContext());
>
>
>
> when you need to reuse the stream (into your parser):
>
>
>
> public void parse(InputStream stream, ContentHandler handler, Metadata
> metadata, ParseContext context)
> throws IOException, SAXException, TikaException {
>
>//(...)
>
>TikaInputStream tis = TikaInputStream.get(stream);
>
>if(tis.hasInputStreamFactory()){
>
> try(InputStream is = tis.getInputStreamFactory().getInputStream()){
>
>   //consume the new stream
>
> }
>
>}else
>
>throw new IOException("not a reusable inputStream");
>
>  }
>
>
>
> Of course this is useful if you are not processing files, e.g. reading
> files from the cloud or sockets.
>
>
>
> Regards,
>
> Luis
>
>
>
>
>
> Em seg., 22 de fev. de 2021 às 19:18, Peter Kronenberg <
> peter.kronenb...@to