HBase Sample Schema

Fuad Efendi Mon, 22 Sep 2008 11:31:01 -0700

Hi,

I found this basic sample and I'd like to confirm my understanding ofuse cases and best practices (applicability) of Hbase... Thanks!

=============

Sample (Ankur Goel, 27-March-08,http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via[email protected] or Nabble):

=============

DESCRIPTION: Used to store seed urls (both old and newly discovered).
             Initially populated with some seed URLs. The crawl
controller
             picks up the seeds from this table that have status=0 (Not
Visited)
                 or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
             in batch to different crawl engines that it knows about.

SCHEMA:      Columns families below

          {"referer_id:", "100"}, // Integer here is Max_Length
        {"url:","1500"},
        {"site:","500"},
        {"last_crawl_date:", "1000"},
        {"next_crawl_date:", "1000"},
        {"create_date:","100"},
        {"status:","100"},
        {"strike:", "100"},
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}


======================
Modified Schema & Analysis (Fuad Efendi):

My understanding is that we need to scan whole table in order to findrecords where (for instance) "last_crawl_date" is "less than specificpoint in time"... Additionally, Crawler should be polite and list ofURLs to fetch should be evenly distributed between domains-hosts-IPs.

Few solutions to find records "last_crawl_date" were a littlediscussed in BLOGs, distribution list, etc:

- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs


My own possible solution, need your feedback:
====================

Simplified schema with two tables (non-transactional:

1. URL_TO_FETCH

{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY(sorted row_id),

        {"url:","1500"},

2. URL_CONTENT
        {"url:","1500"}  PRIMARY KEY (sorted row_id),
        {"site:","500"},
        ... ... ...,
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}

Table URL_TO_FETCH is initially seeded with root domain names and"dummy" last_crawl_date (with unique-per-host 'old-timestamp'):

00000000000000000001  www.website1.com
00000000000000000002  www.website1.com
00000000000000000003  www.website1.com
00000000000000000004  www.website1.com
...


After successful fetch of initial URLs:
00000000010000000001  www.website1.com/page1
00000000010000000002  www.website2.com/page1
00000000010000000003  www.website3.com/page1
00000000010000000004  www.website4.com/page1
...
00000000020000000001  www.website1.com/page2
00000000020000000002  www.website2.com/page2
00000000020000000003  www.website3.com/page2
00000000020000000004  www.website4.com/page2
...
00000000030000000001  www.website1.com/page3
00000000030000000002  www.website2.com/page3
00000000030000000003  www.website3.com/page3
00000000030000000004  www.website4.com/page3
...
...
...
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
...

(xxxxxxxxxx is "current time in milliseconds" - timestamp in case ofsuccessful fetch)

What we have: we don't need additional Lucene index; we don't needMapReduce job to populate list of items to be fetched (the way as it'sdone in Nutch); we don't need thousands per-host-scanners; we havemutable primary key; all new records are inserted at the beginning ofa table; fetched items are moved to end of a table.


Second (helper) table is indexed by URL:
        {"url:","1500"}  PRIMARY KEY (sorted row_id),
        ...

Am I right? It looks cool that with extremely low cost I can maintainspecific "reordering" by mutable primary key following crawl-specificrequirements...


Thanks,
Fuad

HBase Sample Schema

Reply via email to