Re: Language identification

ilhami Kalkan Fri, 08 Nov 2013 07:49:41 -0800

Yes.


On 08-11-2013 17:41, Ralf R. Kotowski wrote:

We are talking about this plug-in, correct?


http://wiki.apache.org/nutch/LanguageIdentifierPlugin



-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Thursday, November 07, 2013 10:29 AM
To: [email protected]
Subject: Re: Language identification

Hi Rulf,

Short answer is no.
This plugin run after language-idendifier plugin. Because,
languge-identifier plugin marks metadata language and this plugin get
this value to filter or accept language while parse phase.
language-identifier plugin gets lang value from header or decide lang
value with page content's n-gram.
language-filter plugin get "language.filter.languages" entries which
must be ISO-639 language codes and match them with metadata lang. Page
languages like en-us were rejected. Thanks for heads-up. I added
necessary control in patch to prevent this case.


On 06-11-2013 23:52, Ralf R. Kotowski wrote:

Hi,

I have run several passes, I no Langer get the bulk of foreign language
sites I used to, but some others which are supossed to I don't get either.

Does this plug-in work trough the HTML header? Because I got one of the

ones

that are not supossed to be there with this header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml"; lang="en-us">

-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Wednesday, November 06, 2013 9:08 AM
To: [email protected]
Subject: Re: Language identification

Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:

I get following error in the logs:

WARN  plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter

-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Tuesday, November 05, 2013 10:36 AM
To: [email protected]
Subject: Re: Language identification

Hi Ralf,

I patched language-filter plugin for filter or accept pages which
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:

Ralf,

The parameter http.accept.language tells the servers you are hitting

that

they should provide you the content in the languages you specified but

that

does not give you any guarantees nor allows you to filter the content.

Look

at the languageidentifier plugin as a starting point, then you could add

custom mapreduce job to remove the pages which are not in the languages

of

interest.

HTH

Julien



On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:

Hi,



What is the correct process to only store documents in a desired

language?

I'm currently doing this:



<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national

group.

</description>
</property>



Using a seed.txt with URL's I know are in the language I want, but as

the

crawl grows it seems I'm starting to get more and more docs in other
languages.





Thnx in advance

Re: Language identification

Reply via email to