Flexible URL normalization
--------------------------

                 Key: NUTCH-365
                 URL: http://issues.apache.org/jira/browse/NUTCH-365
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
             Fix For: 0.9.0


This patch is a heavily restructured version of the patch in NUTCH-253, so much 
that I decided to create a separate issue. It changes the URL normalization 
from a selectable single class to a flexible and context-aware chain of 
normalization filters.

Highlights:

* rename all *UrlNormalizer* to *URLNormalizer* for consistency.

* use a "chained filter" pattern for running several normalizers in sequence

* the order in which normalizers are executed is defined by 
"urlnormalizer.order" property, which lists space-separated implementation 
classes. If there are more normalizers active than explicitly named on this 
list, they will be run in random order after the ones specified on the list are 
executed.

* define a set of contexts (or scopes) in which normalizers may be called. Each 
scope can have its own list of normalizers (via 
"urlnormalizer.scope.<scope_name>" property) and its own order (via 
"urlnormalizer.order.<scope_name>" property). If any of these properties are 
missing, default settings are used.

* each normalizer may further select among many configurations, depending on 
the context in which it is called, using a modified API:

   URLNormalizer.normalize(String url, String scope);

* if a config for a given scope is not defined, then the default config will be 
used.

* several standard contexts / scopes have been defined, and various 
applications have been modified to attempt using appropriate normalizer in 
their context.

* all JUnit tests have been modified, and run successfully.

NUTCH-363 suggests to me that further changes may be required in this area, 
perhaps we should combine urlfilters and urlnormalizers into a single subsystem 
of url munging - now that we have support for scopes and flexible combinations 
of normalizers we could turn URLFilters into a special case of normalizers (or 
vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to