Hi Tim,

bouncing back on our mail thread, could you share more documentation on how to 
use the header to configure the PDFParser on the fly ? 

Thanks,
Julien

-----Message d'origine-----
De : Julien Massiera <julien.massi...@francelabs.com> 
Envoyé : vendredi 3 février 2023 13:08
À : dev@tika.apache.org
Objet : RE: Adding arguments to configure tika from the rest calls

Hi Tim,

The NER Parse config via headers like the PDFParserConfig sounds an interesting 
approach but I have just discovered that feature thanks to your reply and I 
tried to find a documentation about this, unfortunately the only thing I found 
was a TBD note on that page 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

Could you tell us more about how to use it ? so that we can test it to have a 
better idea on how it works and how useful would it be for NER ? 

Thanks,
Julien 

-----Message d'origine-----
De : Tim Allison <talli...@apache.org>
Envoyé : mardi 31 janvier 2023 13:19
À : dev@tika.apache.org
Objet : Re: Adding arguments to configure tika from the rest calls

Configuring specific parsers that don't have their own parser config objects is 
a pain.  For example, we currently have an option to set PDFParserConfig and 
TesseractParserConfig options via headers to tika-server...and we have a way to 
extend this functionality to other parsers.  This option is "not pretty"(TM), 
but it has the benefit of correctly differentiating creation-time settings 
(applies to all
files) from runtime-settings (applies to a specific file), and this process 
reuses a single static parser so there's no overhead in rebuilding the parser 
object for every file.

So, we could add an ner parse config along the lines of the PDFParserConfig, 
or...

...I regret I can't tell if this is what you're proposing, but we could specify 
a tika-config.xml file via url parameters?  This would add overhead of loading 
the full parser for each parse where you specify your own custom parser.  Or, I 
guess, we could load x many default parsers and name them?

On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <cedric.ul...@francelabs.com> 
wrote:
>
> Hi all,
>
> We are playing with the regex-based detection capabilities of Tika combined 
> with ManifoldCF, and an idea came to our mind. First, the problem: for now, a 
> tika server has only one configuration. Therefore, if we set a regex based 
> entity extraction, it will be applied to all of the documents (for given mime 
> types). So if in ManifoldCF we call the Tika server during an crawling phase, 
> we cannot have different regex rules per crawling job: any job that calls the 
> tika server will be processed the same way.
>
> So here is the idea: wouldn't it be possible to make the call to a 
> tika server configurable via a REST parameter/arguments, where we 
> could set which config we want to use for the current call ? Something
> like: ?enableNER=true&NERConfig=regex1
>
> Regards,
>
> Cédric
> CEO
> France Labs - Your knowledge, now
> Datafari Enterprise Search
>


Reply via email to