On 6/22/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > Somehow my original post doesn't render properly because of my email editor. > I'll try using html entities. Here is the list of fetches I get: > fetching http://www.variety.com/RSS.asp > fetching http://www.variety.com/boxoffice > fetching http://www.variety.com/pilotwatch2007 > fetching http://www.variety.com/</div> > fetching http://www.variety.com/review/VE1117933968 > fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html > fetching http://www.variety.com/review/VE1117933972 > fetching http://www.variety.com/article/VR1117967371 > fetching http://www.variety.com/</div></a> > > These two are weird: > fetching http://www.variety.com/</div> > fetching http://www.variety.com/</div></a> > > > (note the ampersand lt; should come out as a less than symbol) > > Why does nutch try to fetch these?
These 'urls' most likely come from parse-js plugin. Can you disable it and see if they disappear? To extract links from js code, parse-js uses a heuristic that unfortunately also may extract garbage urls. Note that a well targeted urlfilter can filter such urls. > > ----- Original Message ---- > From: Kai_testing Middleton <[EMAIL PROTECTED]> > To: nutch user <[EMAIL PROTECTED]> > Sent: Thursday, June 21, 2007 3:24:05 PM > Subject: fetching http://www.variety.com/ > > I'm new to nutch and attempting a few simple tests in preparation for some > major crawling work. > > My current test is to crawl www.variety.com to a depth of 2. > > I have set things up as I'm supposed to but I get the following in my crawl > output: > > fetching http://www.variety.com/RSS.asp > fetching http://www.variety.com/boxoffice > fetching http://www.variety.com/pilotwatch2007 > fetching http://www.variety.com/ > fetching http://www.variety.com/review/VE1117933968 > fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html > fetching http://www.variety.com/review/VE1117933972 > fetching http://www.variety.com/article/VR1117967371 > fetching http://www.variety.com/ > > Is nutch seriously broken? Why is it trying to fetch those two URLs with the > embedded html? > > > Details: > > I'm running nutch 0.9 (the stable download from April 2007) on BSD with java > 1.5. > > I invoke nutch as follows: > nutch crawl urls.txt -dir mydir -depth 2 2>&1 | tee crawl.log > > urls.txt contains this: > http://www.variety.com > > crawl-urlfilter.txt contains this: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://www.variety.com > > # skip everything else > -. > > > > > I have the following in nutch-site.xml > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>http.agent.name</name> > <value>testbed-random</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > and set their values appropriately. > </description> > </property> > <property> > <name>http.agent.description</name> > <value>crawler v0.9</value> > <description>Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > </description> > </property> > <property> > <name>http.agent.url</name> > http://hopoo.dyndns.org > <description>A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > </description> > </property> > <property> > <name>http.agent.email</name> > <value>kai(underscore)testing(att)yahoo(dotcom)</value> > <description>An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > </description> > </property> > </configuration> > > > crawl.log follows: > crawl started in: mydir > rootUrlDir = urls.txt > threads = 10 > depth = 2 > Injector: starting > Injector: crawlDb: mydir/crawldb > Injector: urlDir: urls.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: mydir/segments/20070621145957 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: mydir/segments/20070621145957 > Fetcher: threads: 10 > fetching http://www.variety.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: mydir/crawldb > CrawlDb update: segments: [mydir/segments/20070621145957] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: mydir/segments/20070621150010 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: mydir/segments/20070621150010 > Fetcher: threads: 10 > fetching http://www.variety.com/RSS.asp > fetching http://www.variety.com/boxoffice > fetching http://www.variety.com/pilotwatch2007 > fetching http://www.variety.com/ > fetching http://www.variety.com/review/VE1117933968 > fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html > fetching http://www.variety.com/review/VE1117933972 > fetching http://www.variety.com/article/VR1117967371 > fetching http://www.variety.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: mydir/crawldb > CrawlDb update: segments: [mydir/segments/20070621150010] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: mydir/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: mydir/segments/20070621145957 > LinkDb: adding segment: mydir/segments/20070621150010 > LinkDb: done > Indexer: starting > Indexer: linkdb: mydir/linkdb > Indexer: adding segment: mydir/segments/20070621145957 > Indexer: adding segment: mydir/segments/20070621150010 > Indexing [http://www.variety.com/] with analyzer [EMAIL PROTECTED] (null) > Indexing [http://www.variety.com/pilotwatch2007] with analyzer [EMAIL > PROTECTED] (null) > Optimizing index. > merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs) > Indexer: done > Dedup: starting > Dedup: adding indexes in: mydir/indexes > Dedup: done > merging indexes to: mydir/index > Adding mydir/indexes/part-00000 > done merging > crawl finished: mydir > > > > > > > > ____________________________________________________________________________________ > Looking for a deal? Find great prices on flights and hotels with Yahoo! > FareChase. > http://farechase.yahoo.com/ > > > > > > > > ____________________________________________________________________________________ > Building a website is a piece of cake. Yahoo! Small Business gives you all > the tools to get online. > http://smallbusiness.yahoo.com/webhosting -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
