Hi,

I found this basic sample and I'd like to confirm my understanding of use cases and best practices (applicability) of Hbase... Thanks!
=============


Sample (Ankur Goel, 27-March-08, http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via [email protected] or Nabble):
=============

DESCRIPTION: Used to store seed urls (both old and newly discovered).
             Initially populated with some seed URLs. The crawl
controller
             picks up the seeds from this table that have status=0 (Not
Visited)
                 or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
             in batch to different crawl engines that it knows about.

SCHEMA:      Columns families below

          {"referer_id:", "100"}, // Integer here is Max_Length
        {"url:","1500"},
        {"site:","500"},
        {"last_crawl_date:", "1000"},
        {"next_crawl_date:", "1000"},
        {"create_date:","100"},
        {"status:","100"},
        {"strike:", "100"},
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}


======================
Modified Schema & Analysis (Fuad Efendi):

My understanding is that we need to scan whole table in order to find records where (for instance) "last_crawl_date" is "less than specific point in time"... Additionally, Crawler should be polite and list of URLs to fetch should be evenly distributed between domains-hosts-IPs.

Few solutions to find records "last_crawl_date" were a little discussed in BLOGs, distribution list, etc:
- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs


My own possible solution, need your feedback:
====================

Simplified schema with two tables (non-transactional:

1. URL_TO_FETCH
{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY (sorted row_id),
        {"url:","1500"},

2. URL_CONTENT
        {"url:","1500"}  PRIMARY KEY (sorted row_id),
        {"site:","500"},
        ... ... ...,
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}


Table URL_TO_FETCH is initially seeded with root domain names and "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
00000000000000000001  www.website1.com
00000000000000000002  www.website1.com
00000000000000000003  www.website1.com
00000000000000000004  www.website1.com
...


After successful fetch of initial URLs:
00000000010000000001  www.website1.com/page1
00000000010000000002  www.website2.com/page1
00000000010000000003  www.website3.com/page1
00000000010000000004  www.website4.com/page1
...
00000000020000000001  www.website1.com/page2
00000000020000000002  www.website2.com/page2
00000000020000000003  www.website3.com/page2
00000000020000000004  www.website4.com/page2
...
00000000030000000001  www.website1.com/page3
00000000030000000002  www.website2.com/page3
00000000030000000003  www.website3.com/page3
00000000030000000004  www.website4.com/page3
...
...
...
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
...

(xxxxxxxxxx is "current time in milliseconds" - timestamp in case of successful fetch)

What we have: we don't need additional Lucene index; we don't need MapReduce job to populate list of items to be fetched (the way as it's done in Nutch); we don't need thousands per-host-scanners; we have mutable primary key; all new records are inserted at the beginning of a table; fetched items are moved to end of a table.

Second (helper) table is indexed by URL:
        {"url:","1500"}  PRIMARY KEY (sorted row_id),
        ...


Am I right? It looks cool that with extremely low cost I can maintain specific "reordering" by mutable primary key following crawl-specific requirements...

Thanks,
Fuad




Reply via email to