Re: [Nutch-general] fetching http://www.variety.com/</div></a>

On 6/22/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> Somehow my original post doesn't render properly because of my email editor.  
> I'll try using html entities.  Here is the list of fetches I get:
> fetching http://www.variety.com/RSS.asp
> fetching http://www.variety.com/boxoffice
> fetching http://www.variety.com/pilotwatch2007
> fetching http://www.variety.com/</div>
> fetching http://www.variety.com/review/VE1117933968
> fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
> fetching http://www.variety.com/review/VE1117933972
> fetching http://www.variety.com/article/VR1117967371
> fetching http://www.variety.com/</div></a>
>
> These two are weird:
> fetching http://www.variety.com/</div>
> fetching http://www.variety.com/</div></a>
>
>
> (note the ampersand lt; should come out as a less than symbol)
>
> Why does nutch try to fetch these?


These 'urls' most likely come from parse-js plugin. Can you disable it
and see if they disappear? To extract links from js code, parse-js
uses a heuristic that unfortunately also may extract garbage urls.

Note that a well targeted urlfilter can filter such urls.

>
> ----- Original Message ----
> From: Kai_testing Middleton <[EMAIL PROTECTED]>
> To: nutch user <[EMAIL PROTECTED]>
> Sent: Thursday, June 21, 2007 3:24:05 PM
> Subject: fetching http://www.variety.com/
>
> I'm new to nutch and attempting a few simple tests in preparation for some 
> major crawling work.
>
> My current test is to crawl www.variety.com to a depth of 2.
>
> I have set things up as I'm supposed to but I get the following in my crawl 
> output:
>
> fetching http://www.variety.com/RSS.asp
> fetching http://www.variety.com/boxoffice
> fetching http://www.variety.com/pilotwatch2007
> fetching http://www.variety.com/
> fetching http://www.variety.com/review/VE1117933968
> fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
> fetching http://www.variety.com/review/VE1117933972
> fetching http://www.variety.com/article/VR1117967371
> fetching http://www.variety.com/
>
> Is nutch seriously broken?  Why is it trying to fetch those two URLs with the 
> embedded html?
>
>
> Details:
>
> I'm running nutch 0.9 (the stable download from April 2007) on BSD with java 
> 1.5.
>
> I invoke nutch as follows:
> nutch crawl urls.txt -dir mydir -depth 2 2>&1 | tee crawl.log
>
> urls.txt contains this:
> http://www.variety.com
>
> crawl-urlfilter.txt contains this:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.variety.com
>
> # skip everything else
> -.
>
>
>
>
> I have the following in nutch-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value>testbed-random</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>   NOTE: You should also check other related properties:
>
>     http.robots.agents
>     http.agent.description
>     http.agent.url
>     http.agent.email
>     http.agent.version
>   and set their values appropriately.
>   </description>
> </property>
> <property>
>   <name>http.agent.description</name>
>   <value>crawler v0.9</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
> <property>
>   <name>http.agent.url</name>
>   http://hopoo.dyndns.org
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
> <property>
>   <name>http.agent.email</name>
>   <value>kai(underscore)testing(att)yahoo(dotcom)</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> </configuration>
>
>
> crawl.log follows:
> crawl started in: mydir
> rootUrlDir = urls.txt
> threads = 10
> depth = 2
> Injector: starting
> Injector: crawlDb: mydir/crawldb
> Injector: urlDir: urls.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: mydir/segments/20070621145957
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: mydir/segments/20070621145957
> Fetcher: threads: 10
> fetching http://www.variety.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: mydir/crawldb
> CrawlDb update: segments: [mydir/segments/20070621145957]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: mydir/segments/20070621150010
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: mydir/segments/20070621150010
> Fetcher: threads: 10
> fetching http://www.variety.com/RSS.asp
> fetching http://www.variety.com/boxoffice
> fetching http://www.variety.com/pilotwatch2007
> fetching http://www.variety.com/
> fetching http://www.variety.com/review/VE1117933968
> fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
> fetching http://www.variety.com/review/VE1117933972
> fetching http://www.variety.com/article/VR1117967371
> fetching http://www.variety.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: mydir/crawldb
> CrawlDb update: segments: [mydir/segments/20070621150010]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: mydir/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: mydir/segments/20070621145957
> LinkDb: adding segment: mydir/segments/20070621150010
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: mydir/linkdb
> Indexer: adding segment: mydir/segments/20070621145957
> Indexer: adding segment: mydir/segments/20070621150010
>  Indexing [http://www.variety.com/] with analyzer [EMAIL PROTECTED] (null)
>  Indexing [http://www.variety.com/pilotwatch2007] with analyzer [EMAIL 
> PROTECTED] (null)
> Optimizing index.
> merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs)
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: mydir/indexes
> Dedup: done
> merging indexes to: mydir/index
> Adding mydir/indexes/part-00000
> done merging
> crawl finished: mydir
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Looking for a deal? Find great prices on flights and hotels with Yahoo! 
> FareChase.
> http://farechase.yahoo.com/
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Building a website is a piece of cake. Yahoo! Small Business gives you all 
> the tools to get online.
> http://smallbusiness.yahoo.com/webhosting


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] fetching http://www.variety.com/

Reply via email to