row-key = url

column=f:fi          : fetchInterval (the delay between re-fetches of a
page)
column=f:ts          : fetchTime (indicates when the url will be elligible
for fetching)
column=mk:_injmrk_   : markers
column=mk:dist
column=mtdt:_csh_    : metadata
column=s:s           : status (is the url fetched, unfetched, newly
injected, gone, redirected etc..)



On Thu, Jun 13, 2013 at 6:40 AM, RS <tinyshr...@163.com> wrote:

> I do not what is sotred in the hbase after inject a website.
> When I use the hbase shell  $ scan 'webpage'  , there are :
> hbase(main):028:0> scan '1_webpage'
> ROW                                  COLUMN+CELL
>  com.xinhuanet.www:http/             column=f:fi, timestamp=1371110099941,
> value=\x00'\x8D\x00
>  com.xinhuanet.www:http/             column=f:ts, timestamp=1371110099941,
> value=\x00\x00\x01?<\x87\xBA\x0A
>  com.xinhuanet.www:http/             column=mk:_injmrk_,
> timestamp=1371110099941, value=y
>  com.xinhuanet.www:http/             column=mk:dist,
> timestamp=1371110099941, value=0
>  com.xinhuanet.www:http/             column=mtdt:_csh_,
> timestamp=1371110099941, value=?\x80\x00\x00
>  com.xinhuanet.www:http/             column=s:s, timestamp=1371110099941,
> value=?\x80\x00\x00
> 1 row(s) in 0.0300 seconds
>
>
> So, is only 6 column are setted in the hbase ? And what is the real data
> stored in it?
> I find that in the source code, there is a WebPage Class.  I could not
> understand all, but I think there should be 24 fileds in the hbase for each
> webside.
>   public static final String[] _ALL_FIELDS =
> {"baseUrl","status","fetchTime","prevFetchTime","fetchInterval","retriesSinceFetch","modifiedTime","prevModifiedTime","protocolStatus","content","contentType","prevSignature","signature","title","text","parseStatus","score","reprUrl","headers","outlinks","inlinks","markers","metadata","batchId",};
>
>
> Thanks
> HeChuan
>
>

Reply via email to