Domain İndexing / Query Filter
------------------------------

                 Key: NUTCH-445
                 URL: https://issues.apache.org/jira/browse/NUTCH-445
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, searcher
    Affects Versions: 0.9.0
            Reporter: Enis Soztutar


Hostname's contain information about the domain of th host, and all of the 
subdomains. Indexing and Searching the domains are important for intuitive 
behavior. 

>From DomainIndexingFilter javadoc : 
Adds the domain(hostname) and all super domains to the index. 
 * <br> For http://lucene.apache.org/nutch/ the 
 * following will be added to the index : <br> 
 * <ul>
 * <li>lucene.apache.org </li>
 * <li>apache</li>
 * <li>org </li>
 * </ul>
 * All hostnames are domain names, but not all the domain names are 
 * hostnames. In the above example hostname lucene is a 
 * subdomain of apache.org, which is itself a subdomain of 
 * org <br>
 * 
 
Currently Basic indexing filter indexes the hostname in the site field, and 
query-site plugin 
allows to search in the site field. However site:apache.org will not return 
http://lucene.apache.org

 By indexing the domain, we can be able to search domains. Unlike 
 the site field (indexed by BasicIndexingFilter) search, searching the 
 domain field allows us to retrieve lucene.apache.org to the query 
 apache.org. 
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to