[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476665 ]
Doug Cutting commented on NUTCH-445: ------------------------------------ Setting the boost to non-zero permits a "site:" query with no other terms, but at the cost of inhibiting the conversion of the clause to a cached Lucene filter, which can be a substantial optimization. I think it's better to leave the boost as zero, and then (separately) fix the conversion-to-filter code to not perform this optimization when no other query terms are present. > Domain İndexing / Query Filter > ------------------------------ > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, > TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * <br> For http://lucene.apache.org/nutch/ the > * following will be added to the index : <br> > * <ul> > * <li>lucene.apache.org </li> > * <li>apache</li> > * <li>org </li> > * </ul> > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org <br> > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.