RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
In an earlier version of tika-batch, we had a single AutoDetectParser per 
thread, and we had no problems.  I experimented with a single AutoDetectParser 
across the threads, and we didn’t have problems.

Because of configuration issues, tika-batch is now creating a new parser for 
each file.

In our unit test suite, last I experimented with this, the first initialization 
did take a while, but then there was no measurable extra cost to instantiating 
a new parser.   In short, we didn’t save anything by using a static 
AutoDetectParser instead of just instantiating a new one for each unit test.

If you are going from file system to file system, you might want to consider 
tika-batch.

java -jar tika-app.jar -i  -o 

If you have a whole lot of files (millions), try to isolate Tika in its own jvm 
or server or data center; bad things can happen.  See slide 17: 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf

And: 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/

From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:54 AM
To: user@tika.apache.org
Subject: Re: Is creating new AutoDetectParsers expensive?

I read the first sentence and thought: "Yes! I can save ourselves a bunch of 
memory!"
Then I read the second: "Oh, oh, do I dare trying it out?" : )
Thank you very much for the super-speedy response!

On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. 
> wrote:
You can reuse AutoDetectParser in a multithreaded environment.  You shouldn’t 
have problems with performance or thread safety.

If you find otherwise, please let us know! ☺

From: Haris Osmanagic 
[mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org
Subject: Is creating new AutoDetectParsers expensive?

Hi all!
Let's assume there are really many files to be parsed, and the operation is 
repeated a relatively large number of times each day.
Is it, in that case, too expensive to create new AutoDetectParsers for every 
file? Or, in other words, if I were to reuse a AutoDetectParser for a large 
number of files, would I:
* Have problems with thread-safety?
* Have problems with performance?
Thanks you very much!
Haris Osmanagić


Re: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Haris Osmanagic
I read the first sentence and thought: "Yes! I can save ourselves a bunch
of memory!"

Then I read the second: "Oh, oh, do I dare trying it out?" : )

Thank you very much for the super-speedy response!

On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. 
wrote:

> You can reuse AutoDetectParser in a multithreaded environment.  You
> shouldn’t have problems with performance or thread safety.
>
>
>
> If you find otherwise, please let us know! J
>
>
>
> *From:* Haris Osmanagic [mailto:haris.osmana...@gmail.com]
> *Sent:* Friday, September 30, 2016 10:36 AM
> *To:* user@tika.apache.org
> *Subject:* Is creating new AutoDetectParsers expensive?
>
>
>
> Hi all!
>
> Let's assume there are really many files to be parsed, and the operation
> is repeated a relatively large number of times each day.
>
> Is it, in that case, too expensive to create new AutoDetectParsers for
> every file? Or, in other words, if I were to reuse a AutoDetectParser for a
> large number of files, would I:
>
> * Have problems with thread-safety?
>
> * Have problems with performance?
>
> Thanks you very much!
>
> Haris Osmanagić
>


Re: Code parser?

2016-09-30 Thread Mark Kerzner
Nick, Marcus,

thank you for your help. It works great, and one of the problems that I saw
was indeed with my code, not Tika.

Mark

Mark Kerzner, SHMsoft ,
Book a call with me here 

Mobile: 713-724-2534
Skype: mark.kerzner1


On Thu, Sep 29, 2016 at 5:21 PM, Nick Burch  wrote:

> On Wed, 28 Sep 2016, Mark Kerzner wrote:
>
>> probably yes, but how do I tell it which parser to use? Today, I just do
>> that
>>
>> String text = tika.parseToString(inputStream, metadata);
>>
>> and it know the parser.
>>
>
> That might be your issue. It's quite hard to identify the language of a
> piece of source code from just the first few hundred bytes of text. If you
> tell Tika the filename, including the extension, it'll have much more luck
> spotting the file is code and using the appropriate parser!
>
> (Binary files often have common magic at/near the start that helps Tika
> identify the file type, source code is text based and lacks that)
>
> Nick
>


RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
You can reuse AutoDetectParser in a multithreaded environment.  You shouldn’t 
have problems with performance or thread safety.

If you find otherwise, please let us know! ☺

From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org
Subject: Is creating new AutoDetectParsers expensive?

Hi all!
Let's assume there are really many files to be parsed, and the operation is 
repeated a relatively large number of times each day.
Is it, in that case, too expensive to create new AutoDetectParsers for every 
file? Or, in other words, if I were to reuse a AutoDetectParser for a large 
number of files, would I:
* Have problems with thread-safety?
* Have problems with performance?
Thanks you very much!
Haris Osmanagić