RFC - WWW::Scraper

Glenn Wood Thu, 30 Jan 2003 02:48:59 -0800

TWIMC -

WWW:: Scraper is a module for scraping data from various web-based search engines.

This module has lived in CPAN for a couple of years as WWW::Search::Scraper. Like the WWW::Search version, WWW::Scraper does the following;

1. Sends query to the target search engine.
2. Scans the resultant list pages, extracting data from the HTML and delivering it as discrete fields in multiple response objects.
3. "Backends" customized to each search engine (e.g., Google, NorthernLight) are written in Perl, using whatever modules and methods the backend's author chooses to use to parse the result list HTML.

Beyond the WWW::Search version, WWW::Scraper extends the capability as follows:

4. "Backends" (herein referred to as "search engine interfaces") may be specified using a number of different methods -
4a. Rules-based parsing (the so-called "Scraper frame"), combining HTML tag-capture with text-capture and matching.
4b. HTML may be converted to XML via "HTML Tidy" (invoked by Scraper) and parsed via XPATH-ish formulae.
4c. Rules may be extended by adding custom framing rules.
4d. All the above methods (including Perl) may be applied simultaneously in any single search engine interface.
4e. Sherlock modules are automatically converted to Scraper frames.

5. Parsing is extended into the "detail" page(s) associated with each item listed on the search engine's result list.

6. Canonical Request/Response Model: canonical queries are converted to native queries, and native responses are converted to canonical responses. For instance, "location" is specified by different search engines as "zip=94043", "state=CA&city=Mountain View", or "areacode=650". All of these are specified canonically as "location=US-CA-Mountain View", and translated to the appropriate native field by the search engine interface. Native response fields are similarly translated to the canonical form upon return. (this obviously implies some-to-many and many-to-some translations, which is accommodated easily by Scraper's array based field values).

7. Search engine interfaces will be bundled into categories. These will be based on the Request/Response canon that each uses (e.g., Auction, Finance, Housing, Job, etc). This will make it easier to maintain search engine interfaces separate from the maintenance of the core Scraper.

RFC - WWW::Scraper

Reply via email to