btw, how to implement checkAndPut semantic in phoenix with sql?
On Thu, Jan 16, 2014 at 4:54 PM, Li Li <[email protected]> wrote: > I have read this. But my question is slightly different with prevoius > one. I have four table operations: > 1. check whether a url already exists. if exists, then do nothing. > else insert it. I can implement it by get first and then checkAndPut > in hbase. > 2. select distinct hosts(maybe in future group them and couting) > 3. select urls where host=? and status=? order by priority,add_time limit ? > 4. update table set status=1 where url=? > > If I follow your advice, I should create table and index like this: > > CREATE TABLE url_db ( > status TINYINT, > priority INTEGER NOT NULL, > added_time DATE, > host VARCHAR NOT NULL, > url VARCHAR NOT NULL > CONSTRAINT pk PRIMARY KEY (host, status, priority, added_time, url)); > > create index my_index on url_db(url); > > performance analysis: > 1. check url exists using secondary index, it's fast > 2. select host is fast by rowkey range scan? > 3. select urls by host and status and ordering by priority and > add_time is fast by carefully designed rowkey range scan > 4. update status by secondary index > > my question is what will be done by phoenix? > update status will first look up rowkey in secondary table and find > the original rowkey of url_db(that's my guess), do a Delete and a Put > with url_db and update secondary index by Put? > am I right? > > > On Thu, Jan 16, 2014 at 4:19 PM, James Taylor <[email protected]> wrote: >> Hi, >> That's some really good information about your use case. Here's an idea I >> had for your similar question on the HBase mailing list: >> http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E >> >> Thanks, >> James >> >> >> On Wed, Jan 15, 2014 at 3:37 AM, Li Li <[email protected]> wrote: >>> >>> 1. goal >>> I want to implement a distributed vertical(topical) crawler. it >>> will only store webpages of a certan topic. I will have a classifier >>> to do this. >>> I estimated the amount of webpages that need be store is about >>> tens of millions(maybe hundreds of millions as time goes). >>> for vertical crawler, it should crawl the pages most likely >>> related to my target topics. So I need a frontier that can dispatch >>> task by priorities. >>> for now, the priority is simple but we hope it can deal with >>> complicated priority algorithms. >>> 1.1 host priority >>> we should crawl many hosts rather than only one single host >>> at the same time. initally, each hosts should be equally crawled. but >>> after time, we can calculate the priority of host dynamically >>> e.g. we can control the speed of a certain host by it's >>> crawl history(some site will ban our crawler if we use too many >>> concurrent thread to it). or we can adjust the priority of a host by >>> whether it >>> is relevant to our topic(we can calculate the relevance of >>> crawled page). >>> 1.2 enqueue time >>> first enqueued webpages should get higher priority >>> 1.3 depth >>> webpages with small depth will get higher priority(something >>> like BFS traverse) >>> 1.4 other page priorities >>> e.g. page rank, list page/detail page ... >>> >>> 2. archeitecture >>> see picture: http://www.flickr.com/photos/114261973@N07/ >>> 2.1 Seed Discover >>> use google or other website to find some seed urls >>> 2.2 Url DB >>> a distributed DB to store all metadata about urls(that's the >>> most hbase related) >>> 2.3 Task Scheduler >>> as described before, the task scheduler select top N priority >>> webpages and dispatch them to fetcher clusters >>> 2.4 Message Queues >>> we use ActiveMQ to decouple different modules and also load >>> balance >>> 2.5 Fetchers >>> Download webpages >>> 2.6 WebPageDB >>> store webpages crawled and extracted metadata(such as >>> title,content, pub_time, author, etc ....) of this webpage. we >>> consider using hbase too. >>> 2.7 Extractors >>> Using classifier to judge whether this page is related to >>> our topics and extracting metadata from it and store them to WebPageDB >>> >>> >>> 3. main challenges >>> 3.1 Url DB >>> as described before, this store(maybe hbase) should support >>> sophisticated pirority algorithms. and also we use it to avoid >>> crawling a webpage more than once. >>> 3.2 task scheduler >>> how to achieve our goal >>> >>> 4. current solution >>> 4.1 use hbase(maybe together with phoenix) to store urls(we now >>> have not done the schema design, hoping get some advice here) >>> 4.2 scheduler algorithm >>> int batchSize=10000; >>> //dispatch batchSize tasks to different hosts by host >>> priorities; >>> Map<String,Integer> hostCount=... >>> //select top priority urls from each host >>> List<String> toBeCrawledUrls=new ArrayList<String>(batchSize); >>> for(Entry<String,Integer> entry:hostCount.entrySet()){ >>> //select top priority N urls from a given host >>> List<String> >>> urls=selectTopNUrlsFromHost( >>> entry.getKey(), entry.getValue()); >>> toBeCrawledUrls.addAll(urls); >>> } >>> //dispatch this urls to message queue >>> //monitor the message queue status >>> //if the queue is all(or 3/4) consumed, goto top and >>> dispatch another batch urls >>> >>> 5. table colums >>> 5.1 url varchar e.g. http://www.google.com/index.html >>> 5.2 status tinyint 0: not_crawled 1: crawling 2: success 3: fail >>> 5.3 host varchar www.google.com >>> 5.4 depth tinyint >>> 5.5 crawl_time date(what's the difference of date, time and timestamp?) >>> 5.6 add_time date >>> 5.7 priority int >>> >>> 6. table schema 1 >>> primary key: url >>> index (host,status) including(pirority, add_time, url) >>> 6.1 host >>> select distinct host from url_db; >>> >>> 6.2 dedup >>> select url from url_db where url='http://www.google.com/index.html' >>> btw, how to use upsert to avoid update previous version? >>> something like mysql statement: >>> insert into on duplicate key ignore... >>> 6.3 select >>> select url from url_db where (host, >>> status)=('www.google.com',0) order by priority,add_time limit 100; >>> 6.4 update status >>> update url_db set status=1 where >>> url='http://www.google.com/index.html' >>> performance analysis: >>> 1. select host can use rowkey scan? >>> 2. dedup by url very quick because it's rowkey >>> 3. select is quick because secondary index >>> 4. update will not remove a row(compare with schema2) but >>> Put new values in main table >>> 5. but I guess update will cause row deletion and insertion >>> >>> 7. table schema 2 >>> primary key: url >>> index(host) column_family(status) including(priority, add_time, >>> url) >>> is this valid in phoenix? I mean create index on host and >>> create column_family by status? >>> >>> 8. table schema 3 >>> primary key: (host,status,priority,add_time) >>> index(url) >>> all sql statements are the same as above. >>> >>> performance analysis: >>> 1. select host can use rowkey scan? >>> 2. select fast because of rowkey >>> 3. dedup by secondary index >>> 4. update will cause main table deletion and insertion >>> 5. I guess update will also cause index table update >>> >>> 9. schema 4 >>> using an dedicated host table to store only host, host count is >>> far less than url >>> but using another table will client code to keep consistency. >>> and also may be in future we need some query like: >>> select host,count(host) as cc from url_db group by host order by cc >>> desc; >>> >>> hope you give me some advices, thanks. >> >>
