Do you mind attaching the configuration files? That way is more human readable. The hadoop.log file will be useful too (if too big, please compress)
On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]> wrote: > > Hi, > > I am a new bie to crawling and exploring the possiblities of crawling the > internet websites from my work PC.My work environment is having a proxy to > access the web. > So I have configure the proxy information under the <NUTCH_HOME>/conf/ by > overriding the nutch-site.xml.Attached is the xml for reference. > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.agent.name</name> > <value>ABC</value> > <description>ABC</description> > </property> > <property> > <name>http.agent.description</name> > <value>Acompany</value> > <description>A company</description> > </property> > <property> > <name>http.agent.url</name> > <value></value> > <description></description> > </property> > <property> > <name>http.agent.email</name> > <value></value> > <description></description> > </property> > <property> > <name>http.timeout</name> > <value>10000</value> > <description>The default network timeout, in milliseconds.</description> > </property> > <property> > <name>http.max.delays</name> > <value>100</value> > <description>The number of times a thread will delay when trying to > fetch a page. Each time it finds that a host is busy, it will wait > fetcher.server.delay. After http.max.delays attepts, it will give > up on the page for now.</description> > </property> > <property> > <name>plugin.includes</name> > > > <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > <property> > <name>http.proxy.host</name> > <value>proxy.ABC.COM</value><!--MY WORK PROXY--> > <description>The proxy hostname. If empty, no proxy is > used.</description> > </property> > <property> > <name>http.proxy.port</name> > <value>8080</value> > <description>The proxy port.</description> > </property> > <property> > <name>http.proxy.username</name> > <value>ABCUSER</value><!--MY NETWORK USERID--> > <description>Username for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > NOTE: For NTLM authentication, do not prefix the username with the > domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. > </description> > </property> > <property> > <name>http.proxy.password</name> > <value>XXXXX</value> > <description>Password for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > </description> > </property> > <property> > <name>http.proxy.realm</name> > <value>ABC</value><!--MY NETWORK DOMAIN--> > <description>Authentication realm for proxy. Do not define a value > if realm is not required or authentication should take place for any > realm. NTLM does not use the notion of realms. Specify the domain name > of NTLM authentication as the value for this property. To use this, > 'protocol-httpclient' must be present in the value of > 'plugin.includes' property. > </description> > </property> > <property> > <name>http.agent.host</name> > <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP--> > <description>Name or IP address of the host on which the Nutch crawler > would be running. Currently this is used by 'protocol-httpclient' > plugin. > </description> > </property> > </configuration> > > my crawl-urlfilter.txt is as follows: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > +^http://([a-z0-9]*\.)*yahoo.com/ > > # skip everything else > -. > > > my regex-urlfilter.txt is as follows: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > #+^http://([a-z0-9]*\.)*apache.org/ > +^http://([a-z0-9]*\.)*yahoo.com/ > > # skip everything else > -. > > Also attached is the console /hadoop.log: > > [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C > $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > topN = 50 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146 > Injector: Converting injected urls to crawl db entries. > map url: http://www.yahoo.com/ > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080521130128 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080521130128 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > http.proxy.host = proxy.abc.com > http.proxy.port = 8080 > http.timeout = 10000 > http.content.limit = 65536 > http.agent = abc/Nutch-0.9 (Acompany) > protocol.plugin.check.blocking = true > protocol.plugin.check.robots = true > fetcher.server.delay = 1000 > http.max.delays = 100 > Configured Client > fetch of http://www.yahoo.com/ failed with: Http code=407, > url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080521130128] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080521130140 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080521130140 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > http.proxy.host = proxy.abc.com > http.proxy.port = 8080 > http.timeout = 10000 > http.content.limit = 65536 > http.agent = ABC/Nutch-0.9 (Acompany) > protocol.plugin.check.blocking = true > protocol.plugin.check.robots = true > fetcher.server.delay = 1000 > http.max.delays = 100 > Configured Client > fetch of http://www.yahoo.com/ failed with: Http code=407, > url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080521130140] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080521130154 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080521130154 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > http.proxy.host = proxy.abc.com > http.proxy.port = 8080 > http.timeout = 10000 > http.content.limit = 65536 > http.agent = ABC/Nutch-0.9 (Acompany) > protocol.plugin.check.blocking = true > protocol.plugin.check.robots = true > fetcher.server.delay = 1000 > http.max.delays = 100 > Configured Client > fetch of http://www.yahoo.com/ failed with: Http code=407, > url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080521130154] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: crawl/segments/20080521130128 > LinkDb: adding segment: crawl/segments/20080521130140 > LinkDb: adding segment: crawl/segments/20080521130154 > LinkDb: done > Indexer: starting > Indexer: linkdb: crawl/linkdb > Indexer: adding segment: crawl/segments/20080521130128 > Indexer: adding segment: crawl/segments/20080521130140 > Indexer: adding segment: crawl/segments/20080521130154 > Optimizing index. > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java > :439) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:137) > > [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C > > > > Please clarify what is the issue with the conf and also guide me if any > configurations missing.your help will be greatly appreciated.thanks in > advance. > > Regards > Siva > > -- > View this message in context: > http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17356187p17356187.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > >