Hi,
I found this basic sample and I'd like to confirm my understanding of
use cases and best practices (applicability) of Hbase... Thanks!
=============
Sample (Ankur Goel, 27-March-08,
http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via
[email protected] or Nabble):
=============
DESCRIPTION: Used to store seed urls (both old and newly discovered).
Initially populated with some seed URLs. The crawl
controller
picks up the seeds from this table that have status=0 (Not
Visited)
or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
in batch to different crawl engines that it knows about.
SCHEMA: Columns families below
{"referer_id:", "100"}, // Integer here is Max_Length
{"url:","1500"},
{"site:","500"},
{"last_crawl_date:", "1000"},
{"next_crawl_date:", "1000"},
{"create_date:","100"},
{"status:","100"},
{"strike:", "100"},
{"language:","150"},
{"topic:","500"},
{"depth:","100000"}
======================
Modified Schema & Analysis (Fuad Efendi):
My understanding is that we need to scan whole table in order to find
records where (for instance) "last_crawl_date" is "less than specific
point in time"... Additionally, Crawler should be polite and list of
URLs to fetch should be evenly distributed between domains-hosts-IPs.
Few solutions to find records "last_crawl_date" were a little
discussed in BLOGs, distribution list, etc:
- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs
My own possible solution, need your feedback:
====================
Simplified schema with two tables (non-transactional:
1. URL_TO_FETCH
{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY
(sorted row_id),
{"url:","1500"},
2. URL_CONTENT
{"url:","1500"} PRIMARY KEY (sorted row_id),
{"site:","500"},
... ... ...,
{"language:","150"},
{"topic:","500"},
{"depth:","100000"}
Table URL_TO_FETCH is initially seeded with root domain names and
"dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
00000000000000000001 www.website1.com
00000000000000000002 www.website1.com
00000000000000000003 www.website1.com
00000000000000000004 www.website1.com
...
After successful fetch of initial URLs:
00000000010000000001 www.website1.com/page1
00000000010000000002 www.website2.com/page1
00000000010000000003 www.website3.com/page1
00000000010000000004 www.website4.com/page1
...
00000000020000000001 www.website1.com/page2
00000000020000000002 www.website2.com/page2
00000000020000000003 www.website3.com/page2
00000000020000000004 www.website4.com/page2
...
00000000030000000001 www.website1.com/page3
00000000030000000002 www.website2.com/page3
00000000030000000003 www.website3.com/page3
00000000030000000004 www.website4.com/page3
...
...
...
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
...
(xxxxxxxxxx is "current time in milliseconds" - timestamp in case of
successful fetch)
What we have: we don't need additional Lucene index; we don't need
MapReduce job to populate list of items to be fetched (the way as it's
done in Nutch); we don't need thousands per-host-scanners; we have
mutable primary key; all new records are inserted at the beginning of
a table; fetched items are moved to end of a table.
Second (helper) table is indexed by URL:
{"url:","1500"} PRIMARY KEY (sorted row_id),
...
Am I right? It looks cool that with extremely low cost I can maintain
specific "reordering" by mutable primary key following crawl-specific
requirements...
Thanks,
Fuad