You could write your own simple parse plugin that generates abc.xyz.com/stuff as outlink of www.xyz.com/stuff. Which is then crawled in (one of the) subsequent crawl cycles.
Mathijs Homminga On Nov 7, 2011, at 7:15, Peyman Mohajerian <[email protected]> wrote: > Thanks Sergey, > I don't think I was clear on the issue, the subdomain I'm speaking of > won't be found by the crawler, I have to somehow add it, so in my > original input url of: http://www.xyz.com/stuff > there is absolutely no way the crawler would know about > http://abc.xyz.com/stuff > I have to somehow dynamically add the subdomain. > I also don't have the option of actually adding > 'http://abc.xyz.com/stuff' in my input file (a bit of an extra > convolution I don't want to bore you with!!). > > Thanks, > Peyman > > On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov > <[email protected]> wrote: >> Hi! >> >> I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*" >> instead of urlfilter-domain and set db.ignore.external.links to false, this >> will work, but this is quite slow if you have many regex. >> >> You may also try to add xyz.com to domain-suffixes.xml, this may cause some >> side effects, i had never tested this, just looked in DomainURLFilter >> source, so it's probably not really good idea. >> >> Sergey Volkov >> >> On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote: >>> >>> Hi Guys, >>> >>> Let's say my input file is: >>> http://www.xyz.com/stuff >>> >>> and I have thousands of these URLs in my input. How do I configure >>> Nutch to also crawl this subdomain for each input: >>> http://abc.xyz.com/stuff >>> >>> I don't want to just replace 'www' with 'abc' i want to crawl both. >>> >>> Thanks >>> Peyman >> >> >> >

