Hi,
I think it is important to understand how "site:" works. As far as I know it
is kind of "filter" which is applied to result set. Saying that it means
that it needs to be used in combination with other terms.
After that we run a query "apple" and obtain some results including
the pages from www.apple.com. If we specify the domain for search
"apple site:www.apple.com" we get only the pages from www.apple.com.
The number of resulting pages may be considered the number of the
pages crawled at the domain www.apple.com.
I think that is correct ONLY under circumstance that every page in domain
www.apple.com contains word apple. In simple words it returns all pages
having word apple in its content and then filter out all pages which are not
from domain www.apple.com.
But if we search either for "luggagepensgifts" or for
"luggagepensgifts site:www.luggagepensgifts.com" there are no results
returned. This site is included in the search for sure because
searching for other words specific for its pages returns results. The
same is, e.g. for www.nycexoticcarrentals.com.
It means that no page contains "luggagepensgifts" word in content (site name
is not considered content during search). Thus it make sense that applying
additional filter ("site:" in this case) can't increase number of returned
page.
What may be the matter for this behavior and how can we obtain the
pages with "luggagepensgifts" in domain?
Interestingly, there is another "filter" called "url:" which can be used in
this case and should return expected result.
In short:
url:http - should return all pages having http in its url (should be all
pages crawled via http protocol)
url:http site:apple - should return only pages from domain
www.apple.com(providing they were crawled via http protocol)
url:http site:luggagepensgifts -
should return anly pages from domain www.luggagepensgifts.com (providing
they were crawled via http protocol)
Regards,
Lukas
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general