Domain İndexing / Query Filter
------------------------------
Key: NUTCH-445
URL: https://issues.apache.org/jira/browse/NUTCH-445
Project: Nutch
Issue Type: New Feature
Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Hostname's contain information about the domain of th host, and all of the
subdomains. Indexing and Searching the domains are important for intuitive
behavior.
From DomainIndexingFilter javadoc :
Adds the domain(hostname) and all super domains to the index.
* <br> For http://lucene.apache.org/nutch/ the
* following will be added to the index : <br>
* <ul>
* <li>lucene.apache.org </li>
* <li>apache</li>
* <li>org </li>
* </ul>
* All hostnames are domain names, but not all the domain names are
* hostnames. In the above example hostname lucene is a
* subdomain of apache.org, which is itself a subdomain of
* org <br>
*
Currently Basic indexing filter indexes the hostname in the site field, and
query-site plugin
allows to search in the site field. However site:apache.org will not return
http://lucene.apache.org
By indexing the domain, we can be able to search domains. Unlike
the site field (indexed by BasicIndexingFilter) search, searching the
domain field allows us to retrieve lucene.apache.org to the query
apache.org.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers