Re: crawling a subdomain

Mathijs Homminga Mon, 07 Nov 2011 06:43:24 -0800

You could write your own simple parse plugin that generates abc.xyz.com/stuff 
as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) 
subsequent crawl cycles.


Mathijs Homminga

On Nov 7, 2011, at 7:15, Peyman Mohajerian <[email protected]> wrote:

> Thanks Sergey,
> I don't think I was clear on the issue, the subdomain I'm speaking of
> won't be found by the crawler, I have to somehow add it, so in my
> original input url of: http://www.xyz.com/stuff
> there is absolutely no way the crawler would know about 
> http://abc.xyz.com/stuff
> I have to somehow dynamically add the subdomain.
> I also don't have the option of actually adding
> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra
> convolution I don't want to bore you with!!).
> 
> Thanks,
> Peyman
> 
> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
> <[email protected]> wrote:
>> Hi!
>> 
>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*";
>> instead of urlfilter-domain and set db.ignore.external.links to false, this
>> will work, but this is quite slow if you have many regex.
>> 
>> You may also try to add xyz.com to domain-suffixes.xml, this may cause some
>> side effects, i had never tested this, just looked in DomainURLFilter
>> source, so it's probably not really good idea.
>> 
>> Sergey Volkov
>> 
>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:
>>> 
>>> Hi Guys,
>>> 
>>> Let's say my input file is:
>>> http://www.xyz.com/stuff
>>> 
>>> and I have thousands of these URLs in my input. How do I configure
>>> Nutch to also crawl this subdomain for each input:
>>> http://abc.xyz.com/stuff
>>> 
>>> I don't want to just replace 'www' with 'abc' i want to crawl both.
>>> 
>>> Thanks
>>> Peyman
>> 
>> 
>> 
>

Re: crawling a subdomain

Reply via email to