Thank you very much, I'm testing it right now, so far when trying with only this URL: http://www.todalaprensa.com/ as a seed, nutch only retrieves this page and nothing else. When using a larger seed list it seems to work, I'm currently on the 3rd pass, I'll let you know how it goes as it is still running.
-----Original Message----- From: ilhami Kalkan [mailto:[email protected]] Sent: Wednesday, November 06, 2013 9:08 AM To: [email protected] Subject: Re: Language identification Hi Ralf, language-identifier-agmlab is my test plugin name. I fixed the patch. NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> On 06-11-2013 00:50, Ralf R. Kotowski wrote: > I get following error in the logs: > > WARN plugin.PluginRepository - Missing dependency > language-identifier-agmlab for plugin language-filter > > -----Original Message----- > From: ilhami Kalkan [mailto:[email protected]] > Sent: Tuesday, November 05, 2013 10:36 AM > To: [email protected] > Subject: Re: Language identification > > Hi Ralf, > > I patched language-filter plugin for filter or accept pages which > specified languages while parse phase. > > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> > > > On 02-11-2013 22:05, Julien Nioche wrote: >> Ralf, >> >> The parameter http.accept.language tells the servers you are hitting that >> they should provide you the content in the languages you specified but > that >> does not give you any guarantees nor allows you to filter the content. > Look >> at the languageidentifier plugin as a starting point, then you could add a >> custom mapreduce job to remove the pages which are not in the languages of >> interest. >> >> HTH >> >> Julien >> >> >> >> On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: >> >>> Hi, >>> >>> >>> >>> What is the correct process to only store documents in a desired > language? >>> >>> >>> I'm currently doing this: >>> >>> >>> >>> <property> >>> <name>http.accept.language</name> >>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> >>> <description>Value of the "Accept-Language" request header field. >>> This allows selecting non-English language as default one to retrieve. >>> It is a useful setting for search engines build for certain national > group. >>> </description> >>> </property> >>> >>> >>> >>> Using a seed.txt with URL's I know are in the language I want, but as the >>> crawl grows it seems I'm starting to get more and more docs in other >>> languages. >>> >>> >>> >>> >>> >>> Thnx in advance >>> >>> >

