[ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476665
 ] 

Doug Cutting commented on NUTCH-445:
------------------------------------

Setting the boost to non-zero permits a "site:" query with no other terms, but 
at the cost of inhibiting the conversion of the clause to a cached Lucene 
filter, which can be a substantial optimization.  I think it's better to leave 
the boost as zero, and then (separately) fix the conversion-to-filter code to 
not perform this optimization when no other query terms are present.

> Domain İndexing / Query Filter
> ------------------------------
>
>                 Key: NUTCH-445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-445
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: index_query_domain_v1.0.patch, 
> index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, 
> TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  * <br> For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index : <br> 
>  * <ul>
>  * <li>lucene.apache.org </li>
>  * <li>apache</li>
>  * <li>org </li>
>  * </ul>
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org <br>
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to