[Nutch-general] location identification (was: combining crawl directories)

Stefan Groschupf Thu, 10 Mar 2005 11:24:36 -0800

Keith,

may I was not clear we have may a other use case. The goal of our research project was to identify the location of the content publisher. Since publisher and domain owner are mostly identical our result was fair enough. Of curse we cache whois results since this is a bottleneck and not allowed to do dos attacks on whois servers. Further more we do named entity extraction and then try to classify the entities like address based on the context informations - to assign an location to an domain. However the first method provide better qualities.

Anyway in case you would love to get the location of a server you can do a other nice trick. First there are some free and commercial datasets ip ranges to locations. Do a traceroute to the target server and then try to identify one of the latest servers in the ip-location db. :-) Please note that this method is secured by a patent of a company. I heard google used this method as well but a cout descide that is mechanism is patented by these company - i didn't remember the name.

HTH
Stefan


Am 10.03.2005 um 20:05 schrieb Keith Campbell:

Stefan,
I finally got around to reading the discussion on location limited crawling to which you referred me. The approach you discussed appears to rely on WHOIS lookups to determine the geographic location of web servers to filter urls. It's an interesting idea (and maybe the only way to automate a geographic crawl) but I'm curious about the kinds of results you get with this. You are assuming that most websites use local hosting services and that these websites when locally hosted have content about that locality (rather than some other locality or non-local specific content)

Just curious, from your experience, is there a strong correlation between the geographic location of web hosts and geographic-specific content on those webservers?
How much "noise" do you get from doing this kind of IP-based crawl?
Keith
On Mar 3, 2005, at 5:06 AM, Stefan Groschupf wrote:
Keith,
just copy your all your segment folders in one folder. The create a new empty webdb. As next you can simply update your new webdb with all your existing segments. The result is a merge of your 2 crawl directories. BTW. there was some discussion how to do location limited crawling in the developer mailing list.
HTH
Stefan
Am 03.03.2005 um 06:18 schrieb Keith Campbell:
Hi,
I'm a new user of Nutch running a beta Boston-based search engine here: http://www.beantownsearch.com/. I've been evaluating Nutch and testing various "crawl strategies".

Because of current bandwidth/hardware limitations, I've had to devise a methodical and selective "crawl strategy" to index the most relevant (and comprehensive) Boston-specific websites. (So far so good. My results already appear to be on par with Google's results). In order to continue this strategy (and to see if it is possible to offer a better local search than Google's) I need to learn how to combine multiple crawl directories (each many Gigabytes).
crawl_directory1
        - db
        - segments
crawl_directory2
        - db
        - segments
etc ...
The first directory was seeded with a very large number of Boston-specific URL's which were then used for "intranet crawling" followed by db analysis and numerous rounds of "whole web crawling". I've created additional crawl directories by "intranet crawling" other large collections of distinct Boston URL's. I've now reached the stage where I need to be able to combine these crawl directories into one crawl directory which can then be analyzed and used for continued "whole web crawling." This step is important because it's the only method I could think of to keep my "whole web crawling" on target (i.e. in the Boston area).

So my question is: How do I combine multiple crawl directories into one directory which can be used for additional "whole web crawling"?
Thanks in advance for any help,
Keith Campbell
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:                http://www.find23.net

---------------------------------------------------------------
company:                http://www.media-style.com
forum:          http://www.text-mining.org
blog:                   http://www.find23.net

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] location identification (was: combining crawl directories)

Reply via email to