That is correct and of course the new URLs are based on replacing some parameter in the original list of URLs, e.g. 'www' with 'abc', opposite of filtering. I think I have to modify the source code for this, if so my guess is Injector class would be the best place? Of course idealy I don't want to add my own customization!!
On Sun, Nov 6, 2011 at 11:22 PM, Sergey A Volkov <[email protected]> wrote: > If I understand correctly, > нou can run inject job on your crawldb with new url's and new input file, > old url's would be still in crawldb > > On Mon 07 Nov 2011 10:15:26 AM MSK, Peyman Mohajerian wrote: >> >> Thanks Sergey, >> I don't think I was clear on the issue, the subdomain I'm speaking of >> won't be found by the crawler, I have to somehow add it, so in my >> original input url of: http://www.xyz.com/stuff >> there is absolutely no way the crawler would know about >> http://abc.xyz.com/stuff >> I have to somehow dynamically add the subdomain. >> I also don't have the option of actually adding >> 'http://abc.xyz.com/stuff' in my input file (a bit of an extra >> convolution I don't want to bore you with!!). >> >> Thanks, >> Peyman >> >> On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov >> <[email protected]> wrote: >>> >>> Hi! >>> >>> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*" >>> instead of urlfilter-domain and set db.ignore.external.links to false, >>> this >>> will work, but this is quite slow if you have many regex. >>> >>> You may also try to add xyz.com to domain-suffixes.xml, this may cause >>> some >>> side effects, i had never tested this, just looked in DomainURLFilter >>> source, so it's probably not really good idea. >>> >>> Sergey Volkov >>> >>> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote: >>>> >>>> Hi Guys, >>>> >>>> Let's say my input file is: >>>> http://www.xyz.com/stuff >>>> >>>> and I have thousands of these URLs in my input. How do I configure >>>> Nutch to also crawl this subdomain for each input: >>>> http://abc.xyz.com/stuff >>>> >>>> I don't want to just replace 'www' with 'abc' i want to crawl both. >>>> >>>> Thanks >>>> Peyman >>> >>> >>> > > >

