Hi, You can follow the commands at 3.2: http://wiki.apache.org/nutch/NutchTutorial
You can use individual commands to crawl multiple times. The way it works is that the first time you create a crawl database (list of links to act as root) using the domains of interest (e..g, wikipedia.org<http://wikipedia.org>). After your crawl using that root (you can define the # pages, and the depth) you can update your crawldb with all fetched links in the new segment and repeat the process. After your fetched enough pages, you can create the inverted index. Hope this helps. Regards, -Stavros. On Jun 19, 2014, at 11:22 AM, J Ahn wrote: Hi Volos. The reason why I ask is that I am not familiar with the Nutch and crawling the web sites. Specifically, I do not understand how to perform the crawling multiple times. You mean, I just perform the crawling the same public site, which is described in urllist.txt, multiple times to get a larger index? I crawled the wikipedia.org<http://wikipedia.org/> website, but the index and segments size are 24MB and 156MB, respectively. 1) Firstly, I modified the urllist.txt to crawl the wikipedia. # echo http://www.apache.org<http://www.apache.org/> > $HADOOP_HOME/urlsdir/urllist.txt # $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir 2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover *.wikipedia.org<http://wikipedia.org/>. # vim conf/crawl-urlfilter.txt 3) Finally, I just launched the crawler # $HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3 Is there any problems to increase the index and segment size? - Jeongseob 2014-06-16 18:59 GMT+09:00 Volos Stavros <[email protected]<mailto:[email protected]>>: Hi Jeongseob, Exactly. You need to perform the crawling phase multiple times so that you get a larger index. You don't really need to crawl the same public sites we have crawled nor use the same terms_en.out. In any case, wikipedia was one of them. You need to have enough clients to saturate your CPU while maintaining quality-of-service. Hope this helps. -Stavros. On Jun 8, 2014, at 8:27 PM, J Ahn wrote: I am just wondering how to increase the size of crawled index and segments. It seems that we need to crawl the large data set again. Is this right?? In addition, I would like to reproduce the experimental results appeared in the paper, clearing the clouds. The paper used an index size of 2GB and data segment size of 23GB of content crawled from the public web. Could you explain me which public sites you crawled ? Next, I have a question about configuring clients. How many clients are used in the experiments? and what terms_en.out is used ? - Jeongseob 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]<mailto:[email protected]>>: Hi Zacharias, Have you tried to increase the size of your crawled index and segments? For example, the clearing cloud paper says they used 2GB index and 23GB segments. Best Hailong On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]<mailto:[email protected]>> wrote: Hi, I have a web-search benchmark setup with 4 machines 1 client, 1 front-end, 1 search server and 1 segment server for fetching the summaries. All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they are connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB segments. My problem is that the servers' cpu utilization is very low. The max throughput I managed to get using the faban client or apache benchmark was ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server ~ 10%, segment server ~35-39%. I'm sure that the network is not the bottleneck cause I'm not even close to fill the bandwidth. Can you give any suggestions on how to utilize well the servers or any thoughts on what can be the problem? Thanks Zacharias Hadjilambrou
