Re: use hbase as distributed crawl's scheduler

James Taylor Thu, 02 Jan 2014 22:24:26 -0800

Otis,
I didn't realize Nutch uses HBase underneath. Might be interesting if you
serialized data in a Phoenix-compliant manner, as you could run SQL queries
directly on top of it.


Thanks,
James


On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic <
[email protected]> wrote:

> Hi,
>
> Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
> hood.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[email protected]> wrote:
>
> > hi all,
> >      I want to use hbase to store all urls(crawled or not crawled).
> > And each url will has a column named priority which represent the
> > priority of the url. I want to get the top N urls order by priority(if
> > priority is the same then url whose timestamp is ealier is prefered).
> >      in using something like mysql, my client application may like:
> >      while true:
> >          select  url from url_db order by priority,addedTime limit
> > 1000 where status='not_crawled';
> >          do something with this urls;
> >          extract more urls and insert them into url_db;
> >      How should I design hbase schema for this application? Is hbase
> > suitable for me?
> >      I found in this article
> >
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> > ,
> > they use redis to store urls. I think hbase is originated from
> > bigtable and google use bigtable to store webpage, so for huge number
> > of urls, I prefer distributed system like hbase.
> >
>

Re: use hbase as distributed crawl's scheduler

Reply via email to