I think so,but there no urls and dfs no data jackyu wrote: > > it is ok,url in hadoop log > > On Fri, Apr 3, 2009 at 5:01 PM, zxh116116 <[email protected]> wrote: > >> >> HI all, >> I am a newbie use nutch.when I use nutch-1.0 meeting some problem >> below is my config >> master >> ubuntu3 >> >> slaves >> ubuntu6 >> ubuntu7 >> >> urllist.txt >> http://www.163.com >> >> >> crawl-urlfilter.txt >> # accept hosts in MY.DOMAIN.NAME >> +^http://([a-z0-9]*\.)*163.com/ >> >> hadoop-env.sh >> # Set Hadoop-specific environment variables here. >> >> # The only required environment variable is JAVA_HOME. All others are >> # optional. When running a distributed configuration it is best to >> # set JAVA_HOME in this file, so that it is correctly defined on >> # remote nodes. >> >> # The java implementation to use. Required. >> export JAVA_HOME=/home/hadoop/jdk6 >> >> >> export HADOOP_HOME=/home/hadoop/search >> >> # The maximum amount of heap to use, in MB. Default is 1000. >> # export HADOOP_HEAPSIZE=2000 >> >> # Extra Java runtime options. Empty by default. >> # export HADOOP_OPTS=-server >> >> # Extra ssh options. Default: '-o ConnectTimeout=1 -o >> SendEnv=HADOOP_CONF_DIR'. >> # export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR" >> >> # Where log files are stored. $HADOOP_HOME/logs by default. >> export HADOOP_LOG_DIR=${HADOOP_HOME}/logs >> >> # File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default. >> export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves >> >> # host:path where hadoop code should be rsync'd from. Unset by default. >> # export HADOOP_MASTER=master:/home/$USER/src/hadoop >> >> # The directory where pid files are stored. /tmp by default. >> # export HADOOP_PID_DIR=/var/hadoop/pids >> >> # A string representing this instance of hadoop. $USER by default. >> # export HADOOP_IDENT_STRING=$USER >> >> >> hadoop-site.xml >> >> <configuration> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://ubuntu3:9000/</value> >> <description> >> The name of the default file system. Either the literal string >> "local" or a host:port for NDFS. >> </description> >> </property> >> <property> >> <name>mapred.job.tracker</name> >> <value>hdfs://ubuntu3:9001/</value> >> <description> >> The host and port that the MapReduce job tracker runs at. If >> "local", then jobs are run in-process as a single map and >> reduce task. >> </description> >> </property> >> <property> >> <name>mapred.map.tasks</name> >> <value>2</value> >> <description> >> define mapred.map tasks to be number of slave hosts >> </description> >> </property> >> <property> >> <name>mapred.reduce.tasks</name> >> <value>2</value> >> <description> >> define mapred.reduce tasks to be number of slave hosts >> </description> >> </property> >> <property> >> <name>dfs.name.dir</name> >> <value>/home/hadoop/filesystem/name</value> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/home/hadoop/filesystem/data</value> >> </property> >> <property> >> <name>mapred.system.dir</name> >> <value>/home/hadoop/filesystem/mapreduce/system</value> >> </property> >> <property> >> <name>mapred.local.dir</name> >> <value>/home/hadoop/filesystem/mapreduce/local</value> >> </property> >> <property> >> <name>dfs.replication</name> >> <value>1</value> >> </property> >> </configuration> >> >> nutch-site.xml >> >> configuration> >> <property> >> <name>http.robots.agents</name> >> <value>*</value> >> </property> >> <property> >> <name>http.agent.name</name> >> <value>mic</value> >> </property> >> <property> >> <name>http.agent.url</name> >> <value>www.baidu.com</value> >> </property> >> </configuration> >> >> >> had...@ubuntu3:~/search$ bin/nutch crawl urls -dir crawled -depth 5 -topN >> 100 >> crawl started in: crawled >> rootUrlDir = urls >> threads = 10 >> depth = 5 >> topN = 100 >> Injector: starting >> Injector: crawlDb: crawled/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawled/segments/20090403010943 >> Generator: filtering: true >> Generator: topN: 100 >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawled/segments/20090403010943 >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawled/crawldb >> CrawlDb update: segments: [crawled/segments/20090403010943] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawled/segments/20090403011147 >> Generator: filtering: true >> Generator: topN: 100 >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawled/segments/20090403011147 >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawled/crawldb >> CrawlDb update: segments: [crawled/segments/20090403011147] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawled/segments/20090403011354 >> Generator: filtering: true >> Generator: topN: 100 >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawled/segments/20090403011354 >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawled/crawldb >> CrawlDb update: segments: [crawled/segments/20090403011354] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawled/segments/20090403011601 >> Generator: filtering: true >> Generator: topN: 100 >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawled/segments/20090403011601 >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawled/crawldb >> CrawlDb update: segments: [crawled/segments/20090403011601] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawled/segments/20090403011810 >> Generator: filtering: true >> Generator: topN: 100 >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawled/segments/20090403011810 >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawled/crawldb >> CrawlDb update: segments: [crawled/segments/20090403011810] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> LinkDb: starting >> LinkDb: linkdb: crawled/linkdb >> LinkDb: URL normalize: true >> LinkDb: URL filter: true >> LinkDb: adding segment: >> hdfs://ubuntu3:9000/user/hadoop/crawled/segments/20090403010943 >> LinkDb: adding segment: >> hdfs://ubuntu3:9000/user/hadoop/crawled/segments/20090403011147 >> LinkDb: adding segment: >> hdfs://ubuntu3:9000/user/hadoop/crawled/segments/20090403011354 >> LinkDb: adding segment: >> hdfs://ubuntu3:9000/user/hadoop/crawled/segments/20090403011601 >> LinkDb: adding segment: >> hdfs://ubuntu3:9000/user/hadoop/crawled/segments/20090403011810 >> LinkDb: done >> Indexer: starting >> Indexer: done >> Dedup: starting >> Dedup: adding indexes in: crawled/indexes >> Dedup: done >> merging indexes to: crawled/index >> Adding hdfs://ubuntu3:9000/user/hadoop/crawled/indexes/part-00000 >> Adding hdfs://ubuntu3:9000/user/hadoop/crawled/indexes/part-00001 >> done merging >> crawl finished: crawled >> >> why it no get urls and data,the logs no exception ,help help >> >> >> -- >> View this message in context: >> http://www.nabble.com/nutch-1.0-distribution-config-problem-tp22864593p22864593.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > >
-- View this message in context: http://www.nabble.com/nutch-1.0-distribution-config-problem-tp22864593p22867622.html Sent from the Nutch - User mailing list archive at Nabble.com.
