row-key = url column=f:fi : fetchInterval (the delay between re-fetches of a page) column=f:ts : fetchTime (indicates when the url will be elligible for fetching) column=mk:_injmrk_ : markers column=mk:dist column=mtdt:_csh_ : metadata column=s:s : status (is the url fetched, unfetched, newly injected, gone, redirected etc..)
On Thu, Jun 13, 2013 at 6:40 AM, RS <tinyshr...@163.com> wrote: > I do not what is sotred in the hbase after inject a website. > When I use the hbase shell $ scan 'webpage' , there are : > hbase(main):028:0> scan '1_webpage' > ROW COLUMN+CELL > com.xinhuanet.www:http/ column=f:fi, timestamp=1371110099941, > value=\x00'\x8D\x00 > com.xinhuanet.www:http/ column=f:ts, timestamp=1371110099941, > value=\x00\x00\x01?<\x87\xBA\x0A > com.xinhuanet.www:http/ column=mk:_injmrk_, > timestamp=1371110099941, value=y > com.xinhuanet.www:http/ column=mk:dist, > timestamp=1371110099941, value=0 > com.xinhuanet.www:http/ column=mtdt:_csh_, > timestamp=1371110099941, value=?\x80\x00\x00 > com.xinhuanet.www:http/ column=s:s, timestamp=1371110099941, > value=?\x80\x00\x00 > 1 row(s) in 0.0300 seconds > > > So, is only 6 column are setted in the hbase ? And what is the real data > stored in it? > I find that in the source code, there is a WebPage Class. I could not > understand all, but I think there should be 24 fileds in the hbase for each > webside. > public static final String[] _ALL_FIELDS = > {"baseUrl","status","fetchTime","prevFetchTime","fetchInterval","retriesSinceFetch","modifiedTime","prevModifiedTime","protocolStatus","content","contentType","prevSignature","signature","title","text","parseStatus","score","reprUrl","headers","outlinks","inlinks","markers","metadata","batchId",}; > > > Thanks > HeChuan > >