Domain İndexing / Query Filter ------------------------------ Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar
Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. >From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * <br> For http://lucene.apache.org/nutch/ the * following will be added to the index : <br> * <ul> * <li>lucene.apache.org </li> * <li>apache</li> * <li>org </li> * </ul> * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org <br> * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.