Besides some final changes including comments, documentation, and some simple api changes to upgrade to the hadoop 0.17 api I have completed work on a new scoring and indexing framework. The purpose of this email is to explain the new framework and the tools / jobs that it contains.

The new scoring framework is meant to replace the current OPIC scoring system currently in Nutch. The new scoring tools can be found under the org.apache.nutch.scoring.webgraph package. This package contains a WebGraph job, a Loops job, a LinkRank job, a ScoreUpdater job, tools for dumping and reading different database.

To use the new scoring you would start by running the WebGraph job. This job will take one or more segments and create an outlink database, an inlink database containing all inlinks (no max number of inlinks), and a nodes database that holds the number of in and outlinks and a score for each node. The web graph is able to be updated and takes into account timestamps for fetches when processing. Links from newer fetches will replace links from older fetches meaning links from a given url should always be the most recent for that url and there should be no holdover links which no longer exist. This ensures that as link structures for web pages change, the web graph changes to accommodate and those changes will be reflected in the later link analysis scores.

Once the WebGraph job is run the Loops job can be run. This job will take the webgraph and walk outlinks in an attempt to find link cycles in the graph. This job is computationally expensive and after 2 loops requires a great deal of space. Because of this it is optional but it can help in identifying spam pages and link farms and if run its output will be used inside of the link analysis tool.

The next job is the LinkRank job. This is a link analysis tool similar to page-rank that creates a stable score for a page based on a webpage's inlinks and their recursive scores. LinkRank runs through an iterative cycle to get a converging link score. The scores are stored in the node database inside the web graph.

Once the LinkRank job is run you can use the ScoreUpdater to update a crawl database with the link score for each url. This allows crawls to generate out better topN pages to fetch. Within the older indexing methods the crawl database scores were used as an element in the document boost in the index. The new framework includes a scoring-link plugin that allows it to work with the older indexer. That plugin will take the score from the crawl database, after the ScoreUpdater is run, and will use those scores as the document boost for the url. The new indexing framework does not use the crawl database and takes its scores directly from the node database in the web graph. The ScoreUpdater still needs to be run for the generator to work more efficiently.

The other jobs in the scoring package include LinkDumper, LoopReader, NodeDumper, and NodeReader. LinkDumper creates a database where it is easy to show the inlinks and outlinks to a given page. There is a max number of inlinks that can be stored and displayed in this database per url. LoopReader is used to show the link cycles for a given url. This can only be used if the Loops job was run. The NodeDumper creates a text file showing the top urls for number of inlinks, number of outlinks, or link score. There are different options for each and each must be run separately. This is useful is seeing the top ranking urls. NodeDumper can be run after LinkRank is run or after WebGraph is looking at inlinks or outlinks. NodeReader prints out a single url from the node database.

To recap, scoring would now be run in the following form. Inject, Generate, Fetch and Parse, WebGraph, Loops (optional), LinkRank, ScoreUpdater, then indexing.

The next piece is the new indexing framework and it can be found in the org.apache.nutch.indexer.field package. The current indexer is somewhat limited in what databases can be passed in to the indexing process. This new indexer removes that limitation and gives more granular control over the fields, their boosts, how they are indexed and the document boosts included in an index. The new indexing process consists of two phases. One is taking content or analysis output and creating FieldWritable objects. Two is aggregating those FieldWritable objects and indexing them. There are three jobs in the field package, BasicFields, AnchorFields, and FieldIndexer.

The BasicFields job replaces the current indexer and the index-basic indexing plugin. This job will take one or more segments, find the most recently fetched segment for a given url, and create the appropriate fields for the index. In doing BasicFields removes a common form of duplicates inside of an index, the same url being fetched through multiple redirects.

Basic fields also handles a unique form of representative url logic. If a url has a representative url due to redirects, and LinkRank has been run, the BasicFields job will compare the url and the representative url by link rank score. The one with the highest score will be kept as the url to be shown in the index, the other the orig url in the index. This is useful as an index will usually want to display the url that is highest scoring (i.e. www.google.com instead of google.com) for a given webpage.

AnchorFields replaces the current index-anchor plugin for Nutch and extends it to allow nutch index the best inlink anchors for a given url. In AnchorFields, inlinks to a url are inverted and scored with their parents inlink score. Then the anchor text of the highest scoring urls are converted into FieldWritables to be indexed. Those anchors are also indexed with a FieldBoost equivalent to their parent page inlink score. The idea behind this is higher scoring pages will have better out links and the text of those links will be more relevant to search.

BasicFields and AnchorFields both create databases Those databases are then passed into the FieldIndexer. The FieldIndexer in responsible for simple taking FieldWritable objects and turning them into an Lucene index. It is also responsible for taking special FieldWritable objects with the name Field.DOC_BOOST and aggregating them together to create a final document boost for a given url. This allows multiple types of analysis to be run before indexing and their results aggregated to create a final document score in the index. All doc boost fields are stored showing their contribution to the final document score. An example of a doc boost is the link rank score. The BasicFields job is what takes the link rank and creates a FieldWritable with the doc boost name.

The order for running the new indexing jobs would be BasicFields, AnchorFields, and FieldIndexer. These would need to be run after the new Scoring logic.

So there it is, a new Scoring framework and a new Indexing framework. I believe these two pieces contribute significantly to improving the relevancy in the current Nutch system. These two pieces are currently in Jira as NUTCH-635. I hope to finish up comments, documentation, and other small changes within the next few days and move this into the nutch core. If anybody has any questions or comments, feel free.

Dennis

Reply via email to