Re: RE: French Language Detection with Tika

2017-05-12 Thread Luís Filipe Nassif
Hi Claude, no Tika does not store file system permissions as metadata. That
info not always is present, for example, when files are collected from web,
databases, etc.

PS: currently optimaize language-detector detects 71 languages, not sure if
it is up to date in Tika. For a paragraph aware language detection, they
points to https://code.google.com/p/cld2/ but that is not Java.

Luis

Em 12 de mai de 2017 3:34 PM, "Claude Garceau" <
claude.garceau.vi...@gmail.com> escreveu:

> Thanks a lot Timothy, it leads me to another question...is Tika consider
> as metadata the permissions that are set on the NTFS Security tab of the
> directory tree or on the file ?
>


Re: RE: French Language Detection with Tika

2017-05-12 Thread Claude Garceau
Thanks a lot Timothy, it leads me to another question...is Tika consider as 
metadata the permissions that are set on the NTFS Security tab of the directory 
tree or on the file ? 


Re: French Language Detection with Tika

2017-05-12 Thread Ken Krugler
Hi Claude,

> On May 12, 2017, at 10:59am, Claude Garceau  
> wrote:
> 
> 
> Thank you Ken, realy useful reply...I guess than an high false negative rate 
> (silence) do much more harm than an high false positive rate (noise).
> 
> I would say that more than 90% of the targeted documents are in French 
> although they might have some paragraphs in English but they are not 
> half-half French-English within the same document. And most of them have more 
> than 2 pages, so I guess (you can tell me if not) with enough characters so 
> that the detector operates with fair enough precision ?

Yes, that’s more than enough. An example of short text would be tweets (e.g. < 
100 bytes).

> Another question...can we assign, at the same time, the Tika's French 
> Detector and the English Detector on the same document being parsed so it can 
> be parsed with the two detector on ?

There’s only one detector, and it returns the “best” language. We currently 
don’t support paragraph-by-paragraph detection, though that would be very cool. 
The main problem is the we’d have to buffer up text before emitting it, so that 
we could send out the  element with the “lang” =  attribute before 
emitting the text.

If that’s important, though, it wouldn’t be hard to create your own version of 
the BodyHandler that does this.

— Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Re: French Language Detection with Tika

2017-05-12 Thread Claude Garceau

Thank you Ken, realy useful reply...I guess than an high false negative rate 
(silence) do much more harm than an high false positive rate (noise).

I would say that more than 90% of the targeted documents are in French although 
they might have some paragraphs in English but they are not half-half 
French-English within the same document. And most of them have more than 2 
pages, so I guess (you can tell me if not) with enough characters so that the 
detector operates with fair enough precision ?

Another question...can we assign, at the same time, the Tika's French Detector 
and the English Detector on the same document being parsed so it can be parsed 
with the two detector on ?


RE: French Language Detection with Tika

2017-05-11 Thread Allison, Timothy B.
1)Is Tika able to extract and parse the security of the document collected 
? Can it extract authorization on the file it parses ? I guess Nutch can 
collect these but I have not seen evidence of that. We need this because we 
have to apply the security at the document level (not just at the index or 
repository level) because this is about content that should not be seen unless 
someone is authorized to do so. 
Is this stored within each document, or is this stored external to the 
document.  If internal, depending on the file format, we may already be 
extracting it.  If we aren’t extracting it, please open a ticket with an 
example (dummy) file, and we’ll add that.
If external to the document, no, this is not part of what Tika can do, but you 
could add it to the Metadata object before parsing the document...if that would 
be of any convenience in your workflow.

From: Claude Garceau [mailto:claude.garceau.vi...@gmail.com]
Sent: Wednesday, May 10, 2017 3:38 PM
To: user@tika.apache.org
Subject: French Language Detection with Tika


Let me tell you about my concept, th ebig picture of where I want to
 

We want to collect content from a document management system (Nuxeo), an 
intranet (Drupal) and files from file system (shared drives) in oprder to be 
retrivable by means of a search engine. All of of these sources are internal 
information for internal audience, this is about unstructured content 
(documents and web pages) We want to use Nutch as the crawler on these sources. 
Then Tika would extract and format the data and commit to Elasticsearch (or 
SolR). We then index all of the content Elasticsearch and make them available 
through a web aplication, probably an SPA build on Angluar2. We would make API 
Calls from this SPA to Elasticsearch, so to formulate queries and get results.
 

Now I have 2 questions:  

1) Is Tika able to extract and parse the security of the document collected ? 
Can it extract authorization on the file it parses ? I guess Nutch can collect 
these but I have not seen evidence of that. We need this because we have to 
apply the security at the document level (not just at the index or repository 
level) because this is about content that should not be seen unless someone is 
authorized to do so. 

2) Is Tika efficiently handles the French language detection and extraction ? 
This is a critical capability for my project.

I am currently performing a market survey based on functional and technical 
criterias to select the best tools that would fit my concept. So far ES, Tika 
and Nutch and well positionned ! I am not sure if I will have time to test the 
French ability in Tika, If you are able to refer me to someone or a reference 
place in that respect, I'll have a better degree of confidence im my 
recommandation

Best Regards


Re: French Language Detection with Tika

2017-05-10 Thread Ken Krugler
Hi Claude,

I can’t speak to your first question, but I’ve been involved in language 
detection.

By “efficiency” I assume you mean false positive and false negative rates for 
French documents, yes?

Tika is using the "language-detector” project, which has a false negative rate 
of about 0.2%, and a false positive rate of about 0.001%

But this is on a clean EU dataset for 17 languages. If the document text is 
short, or contains multiple languages, or has markup left in it, then the 
results will be worse.

— Ken

> On May 10, 2017, at 12:38pm, Claude Garceau  
> wrote:
> 
> Let me tell you about my concept, th ebig picture of where I want to 
>  
> 
> We want to collect content from a document management system (Nuxeo), an 
> intranet (Drupal) and files from file system (shared drives) in oprder to be 
> retrivable by means of a search engine. All of of these sources are internal 
> information for internal audience, this is about unstructured content 
> (documents and web pages) We want to use Nutch as the crawler on these 
> sources. Then Tika would extract and format the data and commit to 
> Elasticsearch (or SolR). We then index all of the content Elasticsearch and 
> make them available through a web aplication, probably an SPA build on 
> Angluar2. We would make API Calls from this SPA to Elasticsearch, so to 
> formulate queries and get results.
>  
> 
> Now I have 2 questions:  
> 
> 1) Is Tika able to extract and parse the security of the document collected ? 
> Can it extract authorization on the file it parses ? I guess Nutch can 
> collect these but I have not seen evidence of that. We need this because we 
> have to apply the security at the document level (not just at the index or 
> repository level) because this is about content that should not be seen 
> unless someone is authorized to do so. 
> 
> 2) Is Tika efficiently handles the French language detection and extraction ? 
> This is a critical capability for my project.
> 
> I am currently performing a market survey based on functional and technical 
> criterias to select the best tools that would fit my concept. So far ES, Tika 
> and Nutch and well positionned ! I am not sure if I will have time to test 
> the French ability in Tika, If you are able to refer me to someone or a 
> reference place in that respect, I'll have a better degree of confidence im 
> my recommandation 
> 
> Best Regards
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr