Re: RE: French Language Detection with Tika
Hi Claude, no Tika does not store file system permissions as metadata. That info not always is present, for example, when files are collected from web, databases, etc. PS: currently optimaize language-detector detects 71 languages, not sure if it is up to date in Tika. For a paragraph aware language detection, they points to https://code.google.com/p/cld2/ but that is not Java. Luis Em 12 de mai de 2017 3:34 PM, "Claude Garceau" < claude.garceau.vi...@gmail.com> escreveu: > Thanks a lot Timothy, it leads me to another question...is Tika consider > as metadata the permissions that are set on the NTFS Security tab of the > directory tree or on the file ? >
Re: RE: French Language Detection with Tika
Thanks a lot Timothy, it leads me to another question...is Tika consider as metadata the permissions that are set on the NTFS Security tab of the directory tree or on the file ?
Re: French Language Detection with Tika
Hi Claude, > On May 12, 2017, at 10:59am, Claude Garceau> wrote: > > > Thank you Ken, realy useful reply...I guess than an high false negative rate > (silence) do much more harm than an high false positive rate (noise). > > I would say that more than 90% of the targeted documents are in French > although they might have some paragraphs in English but they are not > half-half French-English within the same document. And most of them have more > than 2 pages, so I guess (you can tell me if not) with enough characters so > that the detector operates with fair enough precision ? Yes, that’s more than enough. An example of short text would be tweets (e.g. < 100 bytes). > Another question...can we assign, at the same time, the Tika's French > Detector and the English Detector on the same document being parsed so it can > be parsed with the two detector on ? There’s only one detector, and it returns the “best” language. We currently don’t support paragraph-by-paragraph detection, though that would be very cool. The main problem is the we’d have to buffer up text before emitting it, so that we could send out the element with the “lang” = attribute before emitting the text. If that’s important, though, it wouldn’t be hard to create your own version of the BodyHandler that does this. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
Re: French Language Detection with Tika
Thank you Ken, realy useful reply...I guess than an high false negative rate (silence) do much more harm than an high false positive rate (noise). I would say that more than 90% of the targeted documents are in French although they might have some paragraphs in English but they are not half-half French-English within the same document. And most of them have more than 2 pages, so I guess (you can tell me if not) with enough characters so that the detector operates with fair enough precision ? Another question...can we assign, at the same time, the Tika's French Detector and the English Detector on the same document being parsed so it can be parsed with the two detector on ?
RE: French Language Detection with Tika
1)Is Tika able to extract and parse the security of the document collected ? Can it extract authorization on the file it parses ? I guess Nutch can collect these but I have not seen evidence of that. We need this because we have to apply the security at the document level (not just at the index or repository level) because this is about content that should not be seen unless someone is authorized to do so. Is this stored within each document, or is this stored external to the document. If internal, depending on the file format, we may already be extracting it. If we aren’t extracting it, please open a ticket with an example (dummy) file, and we’ll add that. If external to the document, no, this is not part of what Tika can do, but you could add it to the Metadata object before parsing the document...if that would be of any convenience in your workflow. From: Claude Garceau [mailto:claude.garceau.vi...@gmail.com] Sent: Wednesday, May 10, 2017 3:38 PM To: user@tika.apache.org Subject: French Language Detection with Tika Let me tell you about my concept, th ebig picture of where I want to We want to collect content from a document management system (Nuxeo), an intranet (Drupal) and files from file system (shared drives) in oprder to be retrivable by means of a search engine. All of of these sources are internal information for internal audience, this is about unstructured content (documents and web pages) We want to use Nutch as the crawler on these sources. Then Tika would extract and format the data and commit to Elasticsearch (or SolR). We then index all of the content Elasticsearch and make them available through a web aplication, probably an SPA build on Angluar2. We would make API Calls from this SPA to Elasticsearch, so to formulate queries and get results. Now I have 2 questions: 1) Is Tika able to extract and parse the security of the document collected ? Can it extract authorization on the file it parses ? I guess Nutch can collect these but I have not seen evidence of that. We need this because we have to apply the security at the document level (not just at the index or repository level) because this is about content that should not be seen unless someone is authorized to do so. 2) Is Tika efficiently handles the French language detection and extraction ? This is a critical capability for my project. I am currently performing a market survey based on functional and technical criterias to select the best tools that would fit my concept. So far ES, Tika and Nutch and well positionned ! I am not sure if I will have time to test the French ability in Tika, If you are able to refer me to someone or a reference place in that respect, I'll have a better degree of confidence im my recommandation Best Regards
Re: French Language Detection with Tika
Hi Claude, I can’t speak to your first question, but I’ve been involved in language detection. By “efficiency” I assume you mean false positive and false negative rates for French documents, yes? Tika is using the "language-detector” project, which has a false negative rate of about 0.2%, and a false positive rate of about 0.001% But this is on a clean EU dataset for 17 languages. If the document text is short, or contains multiple languages, or has markup left in it, then the results will be worse. — Ken > On May 10, 2017, at 12:38pm, Claude Garceau> wrote: > > Let me tell you about my concept, th ebig picture of where I want to > > > We want to collect content from a document management system (Nuxeo), an > intranet (Drupal) and files from file system (shared drives) in oprder to be > retrivable by means of a search engine. All of of these sources are internal > information for internal audience, this is about unstructured content > (documents and web pages) We want to use Nutch as the crawler on these > sources. Then Tika would extract and format the data and commit to > Elasticsearch (or SolR). We then index all of the content Elasticsearch and > make them available through a web aplication, probably an SPA build on > Angluar2. We would make API Calls from this SPA to Elasticsearch, so to > formulate queries and get results. > > > Now I have 2 questions: > > 1) Is Tika able to extract and parse the security of the document collected ? > Can it extract authorization on the file it parses ? I guess Nutch can > collect these but I have not seen evidence of that. We need this because we > have to apply the security at the document level (not just at the index or > repository level) because this is about content that should not be seen > unless someone is authorized to do so. > > 2) Is Tika efficiently handles the French language detection and extraction ? > This is a critical capability for my project. > > I am currently performing a market survey based on functional and technical > criterias to select the best tools that would fit my concept. So far ES, Tika > and Nutch and well positionned ! I am not sure if I will have time to test > the French ability in Tika, If you are able to refer me to someone or a > reference place in that respect, I'll have a better degree of confidence im > my recommandation > > Best Regards > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr