Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling the
> internet websites from my work PC.My work environment is having a proxy to
> access the web.
> So I have configure the proxy information under the <NUTCH_HOME>/conf/ by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>ABC</value>
>  <description>ABC</description>
> </property>
> <property>
>  <name>http.agent.description</name>
>  <value>Acompany</value>
>  <description>A company</description>
> </property>
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
> <name>http.agent.email</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> </property>
> <property>
>  <name>http.max.delays</name>
>  <value>100</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
> <property>
>  <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  </description>
> </property>
> <property>
>  <name>http.proxy.host</name>
>  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
>  <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> <property>
>  <name>http.proxy.port</name>
>  <value>8080</value>
>  <description>The proxy port.</description>
> </property>
> <property>
>  <name>http.proxy.username</name>
>  <value>ABCUSER</value><!--MY NETWORK USERID-->
>  <description>Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  </description>
> </property>
> <property>
>  <name>http.proxy.password</name>
>  <value>XXXXX</value>
>  <description>Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.proxy.realm</name>
>  <value>ABC</value><!--MY NETWORK DOMAIN-->
>  <description>Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.agent.host</name>
>  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
>  <description>Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
> </configuration>
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
> Also attached is the console /hadoop.log:
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
> Injector: Converting injected urls to crawl db entries.
> map url: http://www.yahoo.com/
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130128
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130128
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = abc/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130128]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130140
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130140
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130140]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130154
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130154
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130154]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080521130128
> LinkDb: adding segment: crawl/segments/20080521130140
> LinkDb: adding segment: crawl/segments/20080521130154
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080521130128
> Indexer: adding segment: crawl/segments/20080521130140
> Indexer: adding segment: crawl/segments/20080521130154
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
>
>
>
> Please clarify what is the issue with the conf and also guide me if any
> configurations missing.your help will be greatly appreciated.thanks in
> advance.
>
> Regards
> Siva
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17356187p17356187.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Reply via email to