Hi - if you need inlinks as input you need to change how Nutch works. By 
default, inlinks are only used when indexing. So depending on whatever scoring 
filter you implement, you also need to process inlinks at that state (generator 
or updater). This is going to be a costly process because the linkdb can grow 
quickly and slow to process.

Markus




 
 
-----Original message-----
> From:Benjamin Derei <stygm...@gmail.com <mailto:stygm...@gmail.com> >
> Sent: Saturday 13th September 2014 14:19
> To: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> Subject: Re: generatorsortvalue
> 
> Hi,
> 
> But where can i get the inlinks containing url and anchors?
> 
> Ben.
> 
> Envoyé de mon iPad
> 
> > Le 10 sept. 2014 à 16:02, Jorge Luis Betancourt Gonzalez 
> > <jlbetanco...@uci.cu <mailto:jlbetanco...@uci.cu> > a écrit :
> > 
> > Hi, 
> > 
> > Actually the generatorSortValue() method does not have access to the 
> > ParseData object (which holds all the info extracted by the parsers from 
> > the webpage raw content) as you pointed out. Essentially this method is 
> > used in the Generator class in a very early stage of the crawling process 
> > way before the URL have been fetched or parsed (which is from where the 
> > oulinks ˜ new links come from). 
> > 
> > The best approach is to use the generatorSortValue() which will assign the 
> > initial score and actually will (as you figured out) get you where you 
> > want. 
> > 
> > How do you put your ismarked key into CrawlDatum? do you put it in the 
> > metadata? Perhaps you could alter the score in CrawlDatum directly, since 
> > the default implementation of the scoring plugins for this method is: 
> > datum.getScore() * initSort;
> > 
> > Taking into account what you’re trying to do, I think you could use the 
> > passScoreAfterParsing() method of the ScoringFilter interface. This method 
> > get’s called by the Fetcher after the parse process is done, so you’ll have 
> > access to the ParseMetadata and you can alter this value. I’m not clear if 
> > this will work, but at least worth check it out. One question about this 
> > approach is that if the CrawlDatum score is synchronized with the 
> > Parse/Content score.
> > 
> > Regards,
> > 
> >> On Sep 10, 2014, at 3:24 AM, Benjamin Derei <stygm...@gmail.com 
> >> <mailto:stygm...@gmail.com> > wrote:
> >> 
> >> Hello,
> >> 
> >> I'm using nutch 1.9.
> >> I want to alter the score used for sorting the topn page for the next 
> >> parsing.
> >> I found it working by modifying the return of generatorsortvalue of a 
> >> scoringfilter plugin.
> >> But this fonction don't have anchors text in inputs...
> >> I wrote some inelegant and inefficient code that put a "ismarked" key in 
> >> crawldatum for knowing if anchors text or url contains some words... From 
> >> what function i have to do this?
> >> Is there a complete schema of datas path though each plugins type 
> >> functions?
> >> 
> >> Benjamin.
> >> 
> >> Envoyé de mon iPad
> >> 
> >>> Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez 
> >>> <jlbetanco...@uci.cu <mailto:jlbetanco...@uci.cu> > a écrit :
> >>> 
> >>> You’ll need to write a couple of plugins to accomplish this. Which 
> >>> version of Nutch are you using? In the first case, the score you want to 
> >>> alter is the score that’s indexed into Solr (i.e your backend) ? 
> >>> 
> >>> Regards,
> >>> 
> >>>> On Sep 9, 2014, at 2:38 PM, Benjamin Derei <stygm...@gmail.com 
> >>>> <mailto:stygm...@gmail.com> > wrote:
> >>>> 
> >>>> hi,
> >>>> 
> >>>> i'm a beginner in java and nutch.
> >>>> 
> >>>> I want to orient the crawl with two rules:
> >>>> -if language identifier plugin detect that page is non "fr" the score
> >>>> for sorting should be divided by two.
> >>>> -if an anchor text or link cibling this page contain some therms the
> >>>> score for sorting should be multiplied by ten.
> >>>> 
> >>>> Any help ?
> >>>> 
> >>>> Benjamin.
> >>> 
> >>> Concurso "Mi selfie por los 5". Detalles en 
> >>> http://justiciaparaloscinco.wordpress.com 
> >>> <http://justiciaparaloscinco.wordpress.com> 
> > 
> > Concurso "Mi selfie por los 5". Detalles en 
> > http://justiciaparaloscinco.wordpress.com 
> > <http://justiciaparaloscinco.wordpress.com> 
> 

Reply via email to