fetching http://www.corninc.com.la/_pgtres/stm31.js
i think up i filter the ".js*" 2009/4/1 陈琛 <[email protected]> > > yes, i got same result > > > but i want to limit these three urls > > fetching http://www.allgamerentals.com/ > fetching http://www.mahasan.com/ > fetching http://www.amazingcounters.com/ > fetching http://www.alldvdrentals.com/video-game-rentals.html > fetching http://www.allgamerentals.com/rental-services.php > these urls has nothing to do with the above three, i do not need to fetch > these > > > one urls seems right > fetching http://www.corninc.com.la/_pgtres/stm31.js > is the sub-page of the http://www.corninc.com.la > > 2009/4/1 Alejandro Gonzalez <[email protected]> > >> i've try with these 3 sites and this is what my nutch got: >> >> >> crawl started in: crawl-20090401170024 >> rootUrlDir = urls >> threads = 5 >> depth = 3 >> topN = 30 >> Injector: starting >> Injector: crawlDb: crawl-20090401170024/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl-20090401170024/segments/20090401170027 >> Generator: filtering: false >> Generator: topN: 30 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawl-20090401170024/segments/20090401170027 >> Fetcher: threads: 5 >> fetching http://www.corninc.com.la/ >> fetching http://www.nuol.edu.la/ >> fetching http://www.lao-indochina.com/ >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawl-20090401170024/crawldb >> CrawlDb update: segments: [crawl-20090401170024/segments/20090401170027] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl-20090401170024/segments/20090401170042 >> Generator: filtering: false >> Generator: topN: 30 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawl-20090401170024/segments/20090401170042 >> Fetcher: threads: 5 >> fetching http://www.allgamerentals.com/ >> fetching http://www.mahasan.com/ >> fetching http://www.amazingcounters.com/ >> fetching http://www.alldvdrentals.com/video-game-rentals.html >> fetching http://www.allgamerentals.com/rental-services.php >> fetching http://www.corninc.com.la/_pgtres/stm31.js >> >> >> are you sure it is not a network issue? cause the only strange thing i've >> noticed is the slower fetching... >> >> 2009/4/1 陈琛 <[email protected]> >> >> > http://www.vientianecollege.laopdr.com/ this url i think should change >> to >> > http://www.vientianecollege.com/ >> > it is right , can fetch sub-page >> > >> > >> > http://www.lao-indochina.com >> > http://www.nuol.edu.la >> > http://www.corninc.com.la >> > also only fetch home-page >> > >> > >> > 2009/4/1 陈琛 <[email protected]> >> > >> > > fetch other urls , not Sub-page...... >> > > >> > > 2009/4/1 Alejandro Gonzalez <[email protected]> >> > > >> > >> try using this as filter in crawl-urlfilter.txt and comment the >> others >> > >> +lines >> > >> >> > >> +^http://([a-z0-9]*\.)* >> > >> >> > >> 2009/4/1 Alejandro Gonzalez <[email protected]> >> > >> >> > >> > yeah i thought it first, but i've been having a look into those >> > websites >> > >> > and they have some normal links. i'm gonna deploy a nutch and >> try'em. >> > >> wich >> > >> > version are u running? >> > >> > >> > >> > >> > >> > >> > >> > 2009/4/1 陈琛 <[email protected]> >> > >> > >> > >> >> thanks, but I do not think this is the timeout problem >> > >> >> >> > >> >> i think they are a special website, Perhaps the link they are from >> > >> other >> > >> >> sources >> > >> >> >> > >> >> like some javasripts? >> > >> >> >> > >> >> so i do not know what is right url can be fetch by nutch.. >> > >> >> >> > >> >> 2009/4/1 Alejandro Gonzalez <[email protected]> >> > >> >> >> > >> >> > strange strange :). maybe you got a timeout error? have u change >> > this >> > >> >> > property in the nutch-site or nutch-default? >> > >> >> > >> > >> >> > <property> >> > >> >> > <name>http.timeout</name> >> > >> >> > <value>10000</value> >> > >> >> > <description>The default network timeout, in >> > >> >> milliseconds.</description> >> > >> >> > </property> >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > 2009/4/1 陈琛 <[email protected]> >> > >> >> > >> > >> >> > > >> > >> >> > > thanks very much ;) >> > >> >> > > >> > >> >> > > the log in the cygwin~(out.txt) >> > >> >> > > and the nutch log (hahoop.log) >> > >> >> > > >> > >> >> > > >> > >> >> > > i cannot find the any clues >> > >> >> > > >> > >> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]> >> > >> >> > > >> > >> >> > >> send me the log of the crawling if possible. for sure there >> are >> > >> some >> > >> >> > clues >> > >> >> > >> on it >> > >> >> > >> >> > >> >> > >> 2009/4/1 陈琛 <[email protected]> >> > >> >> > >> >> > >> >> > >> > yes, the depth is 10 and topN is 2000... >> > >> >> > >> > >> > >> >> > >> > So strange....the other urls it is normal..but the 4 >> urls.. >> > >> >> > >> > >> > >> >> > >> > >> > >> >> > >> > >> > >> >> > >> > 2009/4/1 Alejandro Gonzalez < >> [email protected]> >> > >> >> > >> > >> > >> >> > >> > > seems strange. have u tried to start a crawl just with >> these >> > 4 >> > >> >> seed >> > >> >> > >> > pages? >> > >> >> > >> > > >> > >> >> > >> > > Are you setting the topN parameter? >> > >> >> > >> > > >> > >> >> > >> > > >> > >> >> > >> > > 2009/4/1 陈琛 <[email protected]> >> > >> >> > >> > > >> > >> >> > >> > > > >> > >> >> > >> > > > thanks,i have Collection of urls Only these four can >> not >> > >> search >> > >> >> a >> > >> >> > >> > subset >> > >> >> > >> > > > of their pages >> > >> >> > >> > > > >> > >> >> > >> > > > the urls and crawl-urlfilter like Attachment >> > >> >> > >> > > > >> > >> >> > >> > > > >> > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez < >> > >> [email protected]> >> > >> >> > >> > > > >> > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching >> > them >> > >> >> > >> properly? >> > >> >> > >> > > maybe >> > >> >> > >> > > >> it's not getting the content of the pages and so it >> > cannot >> > >> >> > extract >> > >> >> > >> > links >> > >> >> > >> > > >> for >> > >> >> > >> > > >> fetch in the next level (i suppose you have set the >> crawl >> > >> >> depth >> > >> >> > >> just >> > >> >> > >> > for >> > >> >> > >> > > >> the >> > >> >> > >> > > >> seeds level). >> > >> >> > >> > > >> >> > >> >> > >> > > >> So or your filters are skipping the seeds (i suppose >> it's >> > >> not >> > >> >> the >> > >> >> > >> case >> > >> >> > >> > > >> cause >> > >> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching >> > it's >> > >> not >> > >> >> > >> going >> > >> >> > >> > ok >> > >> >> > >> > > >> (network issues?). take a look on that >> > >> >> > >> > > >> >> > >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]> >> > >> >> > >> > > >> >> > >> >> > >> > > >> > HI,all >> > >> >> > >> > > >> > I have four urls, like this: >> > >> >> > >> > > >> > http://www.lao-indochina.com >> > >> >> > >> > > >> > http://www.nuol.edu.la >> > >> >> > >> > > >> > http://www.corninc.com.la >> > >> >> > >> > > >> > http://www.vientianecollege.laopdr.com >> > >> >> > >> > > >> > >> > >> >> > >> > > >> > only fetch the HomePage why? Sub-page is not >> fetch。。。 >> > >> >> > >> > > >> > >> > >> >> > >> > > >> >> > >> >> > >> > > > >> > >> >> > >> > > > >> > >> >> > >> > > >> > >> >> > >> > >> > >> >> > >> >> > >> >> > > >> > >> >> > > >> > >> >> > >> > >> >> >> > >> > >> > >> > >> > >> >> > > >> > > >> > >> > >
