Re: Language identification

Julien Nioche Sat, 02 Nov 2013 13:06:25 -0700

Ralf,

The parameter http.accept.language tells the servers you are hitting that
they should provide you the content in the languages you specified but that
does not give you any guarantees nor allows you to filter the content. Look
at the languageidentifier plugin as a starting point, then you could add a
custom mapreduce job to remove the pages which are not in the languages of
interest.


HTH

Julien



On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:

> Hi,
>
>
>
> What is the correct process to only store documents in a desired language?
>
>
>
> I'm currently doing this:
>
>
>
> <property>
> <name>http.accept.language</name>
> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> <description>Value of the "Accept-Language" request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national group.
> </description>
> </property>
>
>
>
> Using a seed.txt with URL's I know are in the language I want, but as the
> crawl grows it seems I'm starting to get more and more docs in other
> languages.
>
>
>
>
>
> Thnx in advance
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Language identification

Reply via email to