RE: Nutch Crawling - Failed for internet crawling

Sivakumar Sivagnanam NCS Sun, 25 May 2008 21:25:36 -0700

Hi,

Please find the files attached as requested. thanks for the reply.


 


 
 
 
 
Thanks & Regards
Siva
65567233

-----Original Message-----
From: All day coders [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 24, 2008 11:13 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Nutch Crawling - Failed for internet crawling

Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]>
wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling
the
> internet websites from my work PC.My work environment is having a
proxy to
> access the web.
> So I have configure the proxy information under the <NUTCH_HOME>/conf/
by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>ABC</value>
>  <description>ABC</description>
> </property>
> <property>
>  <name>http.agent.description</name>
>  <value>Acompany</value>
>  <description>A company</description>
> </property>
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
> <name>http.agent.email</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in
milliseconds.</description>
> </property>
> <property>
>  <name>http.max.delays</name>
>  <value>100</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
> <property>
>  <name>plugin.includes</name>
>
>
>
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|htm
l|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urln
ormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints
plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please
enable
>  protocol-httpclient, but be aware of possible intermittent problems
with
> the
>  underlying commons-httpclient library.
>  </description>
> </property>
> <property>
>  <name>http.proxy.host</name>
>  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
>  <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> <property>
>  <name>http.proxy.port</name>
>  <value>8080</value>
>  <description>The proxy port.</description>
> </property>
> <property>
>  <name>http.proxy.username</name>
>  <value>ABCUSER</value><!--MY NETWORK USERID-->
>  <description>Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  </description>
> </property>
> <property>
>  <name>http.proxy.password</name>
>  <value>XXXXX</value>
>  <description>Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.proxy.realm</name>
>  <value>ABC</value><!--MY NETWORK DOMAIN-->
>  <description>Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain
name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.agent.host</name>
>  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
>  <description>Name or IP address of the host on which the Nutch
crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
> </configuration>
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
> Also attached is the console /hadoop.log:
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
> Injector: Converting injected urls to crawl db entries.
> map url: http://www.yahoo.com/
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130128
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130128
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = abc/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130128]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130140
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130140
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130140]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130154
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130154
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130154]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080521130128
> LinkDb: adding segment: crawl/segments/20080521130140
> LinkDb: adding segment: crawl/segments/20080521130154
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080521130128
> Indexer: adding segment: crawl/segments/20080521130140
> Indexer: adding segment: crawl/segments/20080521130154
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)
>
> [EMAIL PROTECTED] /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
>
>
>
> Please clarify what is the issue with the conf and also guide me if
any
> configurations missing.your help will be greatly appreciated.thanks in
> advance.
>
> Regards
> Siva
>
> --
> View this message in context:
>
http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17
356187p17356187.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression # prefixed by 
'+' or '-'.  The first matching pattern in the file # determines whether a URL 
is included or ignored.  If no pattern # matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops 
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\.)*yahoo.com/

# skip everything else
-.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>ABC</value>
  <description>ABC</description>
</property>
<property>
  <name>http.agent.description</name>
  <value>Acompany</value>
  <description>A company</description>
</property>
<property>
  <name>http.agent.url</name>
  <value></value>
  <description></description>
</property>
<property>
<name>http.agent.email</name>
  <value></value>
  <description></description>
</property>
<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description> </property> <property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>
<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>
<property>
  <name>http.proxy.host</name>
  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
  <description>The proxy hostname.  If empty, no proxy is used.</description> </property> <property>
  <name>http.proxy.port</name>
  <value>8080</value>
  <description>The proxy port.</description> </property> <property>
  <name>http.proxy.username</name>
  <value>ABCUSER</value><!--MY NETWORK USERID-->
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>
<property>
  <name>http.proxy.password</name>
  <value>XXXXX</value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>
<property>
 <name>http.proxy.realm</name>
  <value>ABC</value><!--MY NETWORK DOMAIN-->
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>
<property>
  <name>http.agent.host</name>
  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>
</configuration>

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression # prefixed by 
'+' or '-'.  The first matching pattern in the file # determines whether a URL 
is included or ignored.  If no pattern # matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops 
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#+^http://([a-z0-9]*\.)*apache.org/
+^http://([a-z0-9]*\.)*yahoo.com/

# skip everything else
-.

RE: Nutch Crawling - Failed for internet crawling

Reply via email to