Ah, that patch is for the 2.x branch and it won't work on trunk but it can be 
ported with relative ease but it'll take some time. 
 
-----Original message-----
> From:Ralf R. Kotowski <[email protected]>
> Sent: Tuesday 5th November 2013 14:26
> To: [email protected]
> Subject: RE: Language identification
> 
> OK, when I do this on the SVN trunk I get:
> 
> blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
> language-filter.patch
> patching file conf/nutch-default.xml
> Hunk #1 succeeded at 941 (offset 19 lines).
> patching file ivy/ivy.xml
> Hunk #1 FAILED at 111.
> 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #1 succeeded at 30 with fuzz 1.
> Hunk #2 succeeded at 79 with fuzz 1.
> Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
> patching file src/plugin/language-filter/build.xml
> patching file src/plugin/language-filter/ivy.xml
> patching file src/plugin/language-filter/plugin.xml
> patching file
> src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
> ter.java
> patching file
> src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
> eFilter.java
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Tuesday, November 05, 2013 1:17 PM
> To: [email protected]
> Subject: RE: Language identification
> 
> These are git patches and work differently then we are used to at the ASF
> (a/ and b/ prefixes).
> In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
> patches.
> 
>  
>  
> -----Original message-----
> > From:Ralf R. Kotowski <[email protected]>
> > Sent: Tuesday 5th November 2013 13:12
> > To: [email protected]
> > Subject: RE: Language identification
> > 
> > Thank you,
> > 
> > I'm still learning ow to patch nutch... not much luck so far...
> > 
> > -----Original Message-----
> > From: ilhami Kalkan [mailto:[email protected]] 
> > Sent: Tuesday, November 05, 2013 10:36 AM
> > To: [email protected]
> > Subject: Re: Language identification
> > 
> > Hi Ralf,
> > 
> > I patched language-filter plugin for filter or accept pages which 
> > specified languages while parse phase.
> > 
> > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> > 
> > 
> > On 02-11-2013 22:05, Julien Nioche wrote:
> > > Ralf,
> > >
> > > The parameter http.accept.language tells the servers you are hitting
> that
> > > they should provide you the content in the languages you specified but
> > that
> > > does not give you any guarantees nor allows you to filter the content.
> > Look
> > > at the languageidentifier plugin as a starting point, then you could add
> a
> > > custom mapreduce job to remove the pages which are not in the languages
> of
> > > interest.
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > >
> > > On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >>
> > >> What is the correct process to only store documents in a desired
> > language?
> > >>
> > >>
> > >>
> > >> I'm currently doing this:
> > >>
> > >>
> > >>
> > >> <property>
> > >> <name>http.accept.language</name>
> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> > >> <description>Value of the "Accept-Language" request header field.
> > >> This allows selecting non-English language as default one to retrieve.
> > >> It is a useful setting for search engines build for certain national
> > group.
> > >> </description>
> > >> </property>
> > >>
> > >>
> > >>
> > >> Using a seed.txt with URL's I know are in the language I want, but as
> the
> > >> crawl grows it seems I'm starting to get more and more docs in other
> > >> languages.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thnx in advance
> > >>
> > >>
> > >
> > 
> > 
> > 
> 
> 

Reply via email to