Re: Nutch Crawling - Failed for internet crawling

All day coders Sat, 24 May 2008 08:14:01 -0700

Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)


On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling the
> internet websites from my work PC.My work environment is having a proxy to
> access the web.
> So I have configure the proxy information under the <NUTCH_HOME>/conf/ by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>ABC</value>
>  <description>ABC</description>
> </property>
> <property>
>  <name>http.agent.description</name>
>  <value>Acompany</value>
>  <description>A company</description>
> </property>
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
> <name>http.agent.email</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> </property>
> <property>
>  <name>http.max.delays</name>
>  <value>100</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
> <property>
>  <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  </description>
> </property>
> <property>
>  <name>http.proxy.host</name>
>  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
>  <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> <property>
>  <name>http.proxy.port</name>
>  <value>8080</value>
>  <description>The proxy port.</description>
> </property>
> <property>
>  <name>http.proxy.username</name>
>  <value>ABCUSER</value><!--MY NETWORK USERID-->
>  <description>Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  </description>
> </property>
> <property>
>  <name>http.proxy.password</name>
>  <value>XXXXX</value>
>  <description>Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.proxy.realm</name>
>  <value>ABC</value><!--MY NETWORK DOMAIN-->
>  <description>Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.agent.host</name>
>  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
>  <description>Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
> </configuration>
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
> Also attached is the console /hadoop.log:
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
> Injector: Converting injected urls to crawl db entries.
> map url: http://www.yahoo.com/
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130128
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130128
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = abc/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130128]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130140
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130140
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130140]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130154
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130154
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130154]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080521130128
> LinkDb: adding segment: crawl/segments/20080521130140
> LinkDb: adding segment: crawl/segments/20080521130154
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080521130128
> Indexer: adding segment: crawl/segments/20080521130140
> Indexer: adding segment: crawl/segments/20080521130154
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
>
>
>
> Please clarify what is the issue with the conf and also guide me if any
> configurations missing.your help will be greatly appreciated.thanks in
> advance.
>
> Regards
> Siva
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17356187p17356187.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Nutch Crawling - Failed for internet crawling

Reply via email to