Hi all,
I've been working recently on a custom scoring plugin, and I found out
some issues with the scoring API that severely limit the way we can
calculate static page scores. I'd like to restart the discussion about
this API, and propose some changes. Any comments or suggestions are welcome!
Currently we use the ScoringFilters API in several places in the code,
in order to adjust page scores during different operations (such as
generating, parsing, updating the db, and indexing). The API can be
divided into two distinct parts - the part that prioritizes pages to be
re-fetched, and the part that calculates static page scores that convey
the idea of query-independent page "importance" or "quality". Of these
two, the second one is more complex and requires better support in the code.
One general comment: all of the issues below could be handled by
special-purpose tools that execute several map-reduce jobs to read the
data, build intermediate databases, and update crawldb / linkdb as
necessary. However, it would be good to investigate if we can come up
with an abstraction that is flexible enough to do this for different
scoring algorithms, and still doesn't require running many additional jobs.
1. Partial link graph issue
---------------------------
Historically speaking, this API was abstracted from the existing Nutch
code, which implemented the OPIC-like scoring. As such, it works on the
premise that it's possible to correctly calculate a static page score
given only the following information:
* page history recorded in CrawlDb,
* new page information from the current segment,
* partial inlink information available in the segments currently being
processed (in updatedb).
This works well for the OPIC (if even there - the jury's still out ;) ),
but not for many other scoring algorithms, including the most popular
ones like PageRank or HITS, or even the inlink-degree algorithm from
Nutch 0.7. Those algorithms require processing of the complete web graph
information (i.e. the inlink / outlink info) that's been collected so far.
I propose the following change: let's update the LinkDb in the same step
as CrawlDb, so that both are up-to-date with the most current
information after finishing the updatedb step. In terms of the input /
output data involved, this change does increase the amount of data to be
processed, by the size of LinkDb. However, the benefit of this change is
that for each page to be updated we have access to its complete inlink
information, which enables us to use other scoring algorithms that
require this data. Also, we don't have to run invertlinks anymore.
So, the updatedb process would look like this:
INPUT:
- CrawlDb: <Text url, CrawlDatum dbDatum>
- LinkDb: <Text url, Inlinks inlinks>
- segment/crawl_fetch:
<Text url, CrawlDatum fetchDatum>
- segment/crawl_parse:
<Text url, CrawlDatum inlink>
MAP: simply collects the input data in records that are able to
accommodate all this information:
<Text url, <Inlinks oldInlinks, List newInlinks,
CrawlDatum dbDatum, CrawlDatum fetchDatum> >
REDUCE: uses a modified version of CrawlDbReducer, which first collapses
all incoming records to a single record in the above format, i.e. it
collects all incoming records and fills in the slots for dbDatum,
fetchDatum, oldInlinks and newInlinks. The we pretty much reuse the rest
of the existing logic in CrawlDbReducer - but at the end of the reduce()
we can pass the full inlink information to ScoringFilters.updateDbScore.
Finally, we aggregate all inlink information and output the following
record:
<Text url, <Inlinks newInlinks, CrawlDatum newDbDatum> >
OUTPUT: we use a special OutputFormat that splits output records into
<url, newInlinks> and <url, newDbDatum> and creates new versions of both
CrawlDb and LinkDb.
2. Lack of global properties
----------------------------
Neither CrawlDb, nor LinkDb, nor segments keep around the most basic
global statistics about them. Currently, in order to tell how many pages
we have in the db it's necessary to run a mapred job (readdb -stats) -
even though this information is static and could've been calculated in
advance for a given generation of db-s or for each segment. This
complicates even simple tasks such as readseg -list, and makes it
difficult to keep around global score-related statistics for db-s and
segments.
So, I propose to add a metadata file located in a well-known location
inside the db or segment directory. The file would be based on a single
MapWritable that contains arbitrary keys/value, including predefined
ones such as the number of records, last update time etc. We would need
to maintain it for each db and each segment. Each operation that changes
a db or a segment would update this information.
In practial terms, I propose to add static methods to CrawlDbReader,
LinkDbReader and SegmentReader, which can retrieve and / or update this
information.
3. Initialization of scoring plugins with global information
------------------------------------------------------------
Current scoring API works only with local properties of the page (I'm
not taking into account plugins that use external information sources -
that's outside of the scope of the API). It doesn't have any built-in
facilities to collect and calculate global properties useful for PR or
HITS calculation, such as e.g. the number of dangling nodes (ie. pages
without outlinks), their total score, the number of inlinks, etc. It
doesn't have the facility to output this collected global information at
the end of the job. Neither has it any facility to initialize scoring
plugins with such information if one exists.
I propose to add the following methods to scoring plugins, so that they
can modify the job configuration right before the job is started, so
that later on the plugins could use this information when scoring
filters are initialized in each task. E.g:
public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
config);
public void prepareGeneratorConfig(Path crawlDb, Configuration config);
public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
segments, Configuration config);
public void prepareUpdateConfig(Path crawlDb, Path[] segments,
Configuration config);
Example: to properly implement the OPIC scoring, it's necessary to
collect the total number of dangling nodes, and the total score from
these nodes. Then, in the next step it's necessary to spread this total
score evenly among all other nodes in the crawldb. Currently this is not
possible unless we run additional jobs, and create additional files to
keep this data around between the steps. It would be more convenient to
keep this data in CrawlDb metadata (see above) and make relevant values
available in the job context (Configuration).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com