Yes.
On 08-11-2013 17:41, Ralf R. Kotowski wrote:
We are talking about this plug-in, correct?
http://wiki.apache.org/nutch/LanguageIdentifierPlugin
-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Thursday, November 07, 2013 10:29 AM
To: [email protected]
Subject: Re: Language identification
Hi Rulf,
Short answer is no.
This plugin run after language-idendifier plugin. Because,
languge-identifier plugin marks metadata language and this plugin get
this value to filter or accept language while parse phase.
language-identifier plugin gets lang value from header or decide lang
value with page content's n-gram.
language-filter plugin get "language.filter.languages" entries which
must be ISO-639 language codes and match them with metadata lang. Page
languages like en-us were rejected. Thanks for heads-up. I added
necessary control in patch to prevent this case.
On 06-11-2013 23:52, Ralf R. Kotowski wrote:
Hi,
I have run several passes, I no Langer get the bulk of foreign language
sites I used to, but some others which are supossed to I don't get either.
Does this plug-in work trough the HTML header? Because I got one of the
ones
that are not supossed to be there with this header:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Wednesday, November 06, 2013 9:08 AM
To: [email protected]
Subject: Re: Language identification
Hi Ralf,
language-identifier-agmlab is my test plugin name. I fixed the patch.
NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
On 06-11-2013 00:50, Ralf R. Kotowski wrote:
I get following error in the logs:
WARN plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter
-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Tuesday, November 05, 2013 10:36 AM
To: [email protected]
Subject: Re: Language identification
Hi Ralf,
I patched language-filter plugin for filter or accept pages which
specified languages while parse phase.
NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
On 02-11-2013 22:05, Julien Nioche wrote:
Ralf,
The parameter http.accept.language tells the servers you are hitting
that
they should provide you the content in the languages you specified but
that
does not give you any guarantees nor allows you to filter the content.
Look
at the languageidentifier plugin as a starting point, then you could add
a
custom mapreduce job to remove the pages which are not in the languages
of
interest.
HTH
Julien
On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
Hi,
What is the correct process to only store documents in a desired
language?
I'm currently doing this:
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national
group.
</description>
</property>
Using a seed.txt with URL's I know are in the language I want, but as
the
crawl grows it seems I'm starting to get more and more docs in other
languages.
Thnx in advance