thanks .. i also do have... but i donot got the normal page like: http://www.corninc.com.la/faqs.htm
2009/4/1 Alejandro Gonzalez <[email protected]> > yes. i've launch a crawl with this filters: > > +^http://([a-z0-9]*\.)*corninc.com.la/* > +^http://([a-z0-9]*\.)*lao-indochina.com/* > +^http://([a-z0-9]*\.)*nuol.edu.la/* > > and the only link it gets in that domains is that .js. in the others i've > got timeouts > > > 2009/4/1 陈琛 <[email protected]> > > > fetching http://www.corninc.com.la/_pgtres/stm31.js > > > > i think up i filter the ".js*" > > > > 2009/4/1 陈琛 <[email protected]> > > > > > > > > yes, i got same result > > > > > > > > > but i want to limit these three urls > > > > > > fetching http://www.allgamerentals.com/ > > > fetching http://www.mahasan.com/ > > > fetching http://www.amazingcounters.com/ > > > fetching http://www.alldvdrentals.com/video-game-rentals.html > > > fetching http://www.allgamerentals.com/rental-services.php > > > these urls has nothing to do with the above three, i do not need to > fetch > > > these > > > > > > > > > one urls seems right > > > fetching http://www.corninc.com.la/_pgtres/stm31.js > > > is the sub-page of the http://www.corninc.com.la > > > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > > > >> i've try with these 3 sites and this is what my nutch got: > > >> > > >> > > >> crawl started in: crawl-20090401170024 > > >> rootUrlDir = urls > > >> threads = 5 > > >> depth = 3 > > >> topN = 30 > > >> Injector: starting > > >> Injector: crawlDb: crawl-20090401170024/crawldb > > >> Injector: urlDir: urls > > >> Injector: Converting injected urls to crawl db entries. > > >> Injector: Merging injected urls into crawl db. > > >> Injector: done > > >> Generator: Selecting best-scoring urls due for fetch. > > >> Generator: starting > > >> Generator: segment: crawl-20090401170024/segments/20090401170027 > > >> Generator: filtering: false > > >> Generator: topN: 30 > > >> Generator: jobtracker is 'local', generating exactly one partition. > > >> Generator: Partitioning selected urls by host, for politeness. > > >> Generator: done. > > >> Fetcher: starting > > >> Fetcher: segment: crawl-20090401170024/segments/20090401170027 > > >> Fetcher: threads: 5 > > >> fetching http://www.corninc.com.la/ > > >> fetching http://www.nuol.edu.la/ > > >> fetching http://www.lao-indochina.com/ > > >> Fetcher: done > > >> CrawlDb update: starting > > >> CrawlDb update: db: crawl-20090401170024/crawldb > > >> CrawlDb update: segments: > [crawl-20090401170024/segments/20090401170027] > > >> CrawlDb update: additions allowed: true > > >> CrawlDb update: URL normalizing: true > > >> CrawlDb update: URL filtering: true > > >> CrawlDb update: Merging segment data into db. > > >> CrawlDb update: done > > >> Generator: Selecting best-scoring urls due for fetch. > > >> Generator: starting > > >> Generator: segment: crawl-20090401170024/segments/20090401170042 > > >> Generator: filtering: false > > >> Generator: topN: 30 > > >> Generator: jobtracker is 'local', generating exactly one partition. > > >> Generator: Partitioning selected urls by host, for politeness. > > >> Generator: done. > > >> Fetcher: starting > > >> Fetcher: segment: crawl-20090401170024/segments/20090401170042 > > >> Fetcher: threads: 5 > > >> fetching http://www.allgamerentals.com/ > > >> fetching http://www.mahasan.com/ > > >> fetching http://www.amazingcounters.com/ > > >> fetching http://www.alldvdrentals.com/video-game-rentals.html > > >> fetching http://www.allgamerentals.com/rental-services.php > > >> fetching http://www.corninc.com.la/_pgtres/stm31.js > > >> > > >> > > >> are you sure it is not a network issue? cause the only strange thing > > i've > > >> noticed is the slower fetching... > > >> > > >> 2009/4/1 陈琛 <[email protected]> > > >> > > >> > http://www.vientianecollege.laopdr.com/ this url i think should > > change > > >> to > > >> > http://www.vientianecollege.com/ > > >> > it is right , can fetch sub-page > > >> > > > >> > > > >> > http://www.lao-indochina.com > > >> > http://www.nuol.edu.la > > >> > http://www.corninc.com.la > > >> > also only fetch home-page > > >> > > > >> > > > >> > 2009/4/1 陈琛 <[email protected]> > > >> > > > >> > > fetch other urls , not Sub-page...... > > >> > > > > >> > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > >> > > > > >> > >> try using this as filter in crawl-urlfilter.txt and comment the > > >> others > > >> > >> +lines > > >> > >> > > >> > >> +^http://([a-z0-9]*\.)* > > >> > >> > > >> > >> 2009/4/1 Alejandro Gonzalez <[email protected]> > > >> > >> > > >> > >> > yeah i thought it first, but i've been having a look into those > > >> > websites > > >> > >> > and they have some normal links. i'm gonna deploy a nutch and > > >> try'em. > > >> > >> wich > > >> > >> > version are u running? > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > 2009/4/1 陈琛 <[email protected]> > > >> > >> > > > >> > >> >> thanks, but I do not think this is the timeout problem > > >> > >> >> > > >> > >> >> i think they are a special website, Perhaps the link they are > > from > > >> > >> other > > >> > >> >> sources > > >> > >> >> > > >> > >> >> like some javasripts? > > >> > >> >> > > >> > >> >> so i do not know what is right url can be fetch by nutch.. > > >> > >> >> > > >> > >> >> 2009/4/1 Alejandro Gonzalez <[email protected]> > > >> > >> >> > > >> > >> >> > strange strange :). maybe you got a timeout error? have u > > change > > >> > this > > >> > >> >> > property in the nutch-site or nutch-default? > > >> > >> >> > > > >> > >> >> > <property> > > >> > >> >> > <name>http.timeout</name> > > >> > >> >> > <value>10000</value> > > >> > >> >> > <description>The default network timeout, in > > >> > >> >> milliseconds.</description> > > >> > >> >> > </property> > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > 2009/4/1 陈琛 <[email protected]> > > >> > >> >> > > > >> > >> >> > > > > >> > >> >> > > thanks very much ;) > > >> > >> >> > > > > >> > >> >> > > the log in the cygwin~(out.txt) > > >> > >> >> > > and the nutch log (hahoop.log) > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > i cannot find the any clues > > >> > >> >> > > > > >> > >> >> > > 2009/4/1 Alejandro Gonzalez < > > [email protected]> > > >> > >> >> > > > > >> > >> >> > >> send me the log of the crawling if possible. for sure > there > > >> are > > >> > >> some > > >> > >> >> > clues > > >> > >> >> > >> on it > > >> > >> >> > >> > > >> > >> >> > >> 2009/4/1 陈琛 <[email protected]> > > >> > >> >> > >> > > >> > >> >> > >> > yes, the depth is 10 and topN is 2000... > > >> > >> >> > >> > > > >> > >> >> > >> > So strange....the other urls it is normal..but the 4 > > >> urls.. > > >> > >> >> > >> > > > >> > >> >> > >> > > > >> > >> >> > >> > > > >> > >> >> > >> > 2009/4/1 Alejandro Gonzalez < > > >> [email protected]> > > >> > >> >> > >> > > > >> > >> >> > >> > > seems strange. have u tried to start a crawl just > with > > >> these > > >> > 4 > > >> > >> >> seed > > >> > >> >> > >> > pages? > > >> > >> >> > >> > > > > >> > >> >> > >> > > Are you setting the topN parameter? > > >> > >> >> > >> > > > > >> > >> >> > >> > > > > >> > >> >> > >> > > 2009/4/1 陈琛 <[email protected]> > > >> > >> >> > >> > > > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > thanks,i have Collection of urls Only these four > can > > >> not > > >> > >> search > > >> > >> >> a > > >> > >> >> > >> > subset > > >> > >> >> > >> > > > of their pages > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > the urls and crawl-urlfilter like Attachment > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez < > > >> > >> [email protected]> > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's > > fetching > > >> > them > > >> > >> >> > >> properly? > > >> > >> >> > >> > > maybe > > >> > >> >> > >> > > >> it's not getting the content of the pages and so > it > > >> > cannot > > >> > >> >> > extract > > >> > >> >> > >> > links > > >> > >> >> > >> > > >> for > > >> > >> >> > >> > > >> fetch in the next level (i suppose you have set > the > > >> crawl > > >> > >> >> depth > > >> > >> >> > >> just > > >> > >> >> > >> > for > > >> > >> >> > >> > > >> the > > >> > >> >> > >> > > >> seeds level). > > >> > >> >> > >> > > >> > > >> > >> >> > >> > > >> So or your filters are skipping the seeds (i > suppose > > >> it's > > >> > >> not > > >> > >> >> the > > >> > >> >> > >> case > > >> > >> >> > >> > > >> cause > > >> > >> >> > >> > > >> you say that urls arrive to Fetcher), or the > > fetching > > >> > it's > > >> > >> not > > >> > >> >> > >> going > > >> > >> >> > >> > ok > > >> > >> >> > >> > > >> (network issues?). take a look on that > > >> > >> >> > >> > > >> > > >> > >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]> > > >> > >> >> > >> > > >> > > >> > >> >> > >> > > >> > HI,all > > >> > >> >> > >> > > >> > I have four urls, like this: > > >> > >> >> > >> > > >> > http://www.lao-indochina.com > > >> > >> >> > >> > > >> > http://www.nuol.edu.la > > >> > >> >> > >> > > >> > http://www.corninc.com.la > > >> > >> >> > >> > > >> > http://www.vientianecollege.laopdr.com > > >> > >> >> > >> > > >> > > > >> > >> >> > >> > > >> > only fetch the HomePage why? Sub-page is not > > >> fetch。。。 > > >> > >> >> > >> > > >> > > > >> > >> >> > >> > > >> > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > > > >> > >> >> > >> > > > > >> > >> >> > >> > > > >> > >> >> > >> > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > >> > >> >> > > >> > >> > > > >> > >> > > > >> > >> > > >> > > > > >> > > > > >> > > > >> > > > > > > > > >
