Scoring API issues (LONG)

Andrzej Bialecki Thu, 13 Sep 2007 08:45:14 -0700

Hi all,

I've been working recently on a custom scoring plugin, and I found outsome issues with the scoring API that severely limit the way we cancalculate static page scores. I'd like to restart the discussion aboutthis API, and propose some changes. Any comments or suggestions are welcome!

Currently we use the ScoringFilters API in several places in the code,in order to adjust page scores during different operations (such asgenerating, parsing, updating the db, and indexing). The API can bedivided into two distinct parts - the part that prioritizes pages to bere-fetched, and the part that calculates static page scores that conveythe idea of query-independent page "importance" or "quality". Of thesetwo, the second one is more complex and requires better support in the code.

One general comment: all of the issues below could be handled byspecial-purpose tools that execute several map-reduce jobs to read thedata, build intermediate databases, and update crawldb / linkdb asnecessary. However, it would be good to investigate if we can come upwith an abstraction that is flexible enough to do this for differentscoring algorithms, and still doesn't require running many additional jobs.


1. Partial link graph issue
---------------------------

Historically speaking, this API was abstracted from the existing Nutchcode, which implemented the OPIC-like scoring. As such, it works on thepremise that it's possible to correctly calculate a static page scoregiven only the following information:


* page history recorded in CrawlDb,
* new page information from the current segment,

* partial inlink information available in the segments currently beingprocessed (in updatedb).

This works well for the OPIC (if even there - the jury's still out ;) ),but not for many other scoring algorithms, including the most popularones like PageRank or HITS, or even the inlink-degree algorithm fromNutch 0.7. Those algorithms require processing of the complete web graphinformation (i.e. the inlink / outlink info) that's been collected so far.

I propose the following change: let's update the LinkDb in the same stepas CrawlDb, so that both are up-to-date with the most currentinformation after finishing the updatedb step. In terms of the input /output data involved, this change does increase the amount of data to beprocessed, by the size of LinkDb. However, the benefit of this change isthat for each page to be updated we have access to its complete inlinkinformation, which enables us to use other scoring algorithms thatrequire this data. Also, we don't have to run invertlinks anymore.


So, the updatedb process would look like this:

INPUT:
        - CrawlDb: <Text url, CrawlDatum dbDatum>
        - LinkDb:  <Text url, Inlinks inlinks>
        - segment/crawl_fetch:
                   <Text url, CrawlDatum fetchDatum>
        - segment/crawl_parse:
                   <Text url, CrawlDatum inlink>

MAP: simply collects the input data in records that are able toaccommodate all this information:


        <Text url, <Inlinks oldInlinks, List newInlinks,
                CrawlDatum dbDatum, CrawlDatum fetchDatum> >

REDUCE: uses a modified version of CrawlDbReducer, which first collapsesall incoming records to a single record in the above format, i.e. itcollects all incoming records and fills in the slots for dbDatum,fetchDatum, oldInlinks and newInlinks. The we pretty much reuse the restof the existing logic in CrawlDbReducer - but at the end of the reduce()we can pass the full inlink information to ScoringFilters.updateDbScore.Finally, we aggregate all inlink information and output the followingrecord:


        <Text url, <Inlinks newInlinks, CrawlDatum newDbDatum> >

OUTPUT: we use a special OutputFormat that splits output records into<url, newInlinks> and <url, newDbDatum> and creates new versions of bothCrawlDb and LinkDb.


2. Lack of global properties
----------------------------

Neither CrawlDb, nor LinkDb, nor segments keep around the most basicglobal statistics about them. Currently, in order to tell how many pageswe have in the db it's necessary to run a mapred job (readdb -stats) -even though this information is static and could've been calculated inadvance for a given generation of db-s or for each segment. Thiscomplicates even simple tasks such as readseg -list, and makes itdifficult to keep around global score-related statistics for db-s andsegments.

So, I propose to add a metadata file located in a well-known locationinside the db or segment directory. The file would be based on a singleMapWritable that contains arbitrary keys/value, including predefinedones such as the number of records, last update time etc. We would needto maintain it for each db and each segment. Each operation that changesa db or a segment would update this information.

In practial terms, I propose to add static methods to CrawlDbReader,LinkDbReader and SegmentReader, which can retrieve and / or update thisinformation.


3. Initialization of scoring plugins with global information
------------------------------------------------------------

Current scoring API works only with local properties of the page (I'mnot taking into account plugins that use external information sources -that's outside of the scope of the API). It doesn't have any built-infacilities to collect and calculate global properties useful for PR orHITS calculation, such as e.g. the number of dangling nodes (ie. pageswithout outlinks), their total score, the number of inlinks, etc. Itdoesn't have the facility to output this collected global information atthe end of the job. Neither has it any facility to initialize scoringplugins with such information if one exists.

I propose to add the following methods to scoring plugins, so that theycan modify the job configuration right before the job is started, sothat later on the plugins could use this information when scoringfilters are initialized in each task. E.g:

public void prepareInjectorConfig(Path crawlDb, Path urls, Configurationconfig);

public void prepareGeneratorConfig(Path crawlDb, Configuration config);

public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]segments, Configuration config);public void prepareUpdateConfig(Path crawlDb, Path[] segments,Configuration config);

Example: to properly implement the OPIC scoring, it's necessary tocollect the total number of dangling nodes, and the total score fromthese nodes. Then, in the next step it's necessary to spread this totalscore evenly among all other nodes in the crawldb. Currently this is notpossible unless we run additional jobs, and create additional files tokeep this data around between the steps. It would be more convenient tokeep this data in CrawlDb metadata (see above) and make relevant valuesavailable in the job context (Configuration).



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Scoring API issues (LONG)

Reply via email to