Ralf, The parameter http.accept.language tells the servers you are hitting that they should provide you the content in the languages you specified but that does not give you any guarantees nor allows you to filter the content. Look at the languageidentifier plugin as a starting point, then you could add a custom mapreduce job to remove the pages which are not in the languages of interest.
HTH Julien On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: > Hi, > > > > What is the correct process to only store documents in a desired language? > > > > I'm currently doing this: > > > > <property> > <name>http.accept.language</name> > <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > <description>Value of the "Accept-Language" request header field. > This allows selecting non-English language as default one to retrieve. > It is a useful setting for search engines build for certain national group. > </description> > </property> > > > > Using a seed.txt with URL's I know are in the language I want, but as the > crawl grows it seems I'm starting to get more and more docs in other > languages. > > > > > > Thnx in advance > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

