Thanks for the explanation Stack. Using my threaded client
I got a throughput of 6000 inserts/sec. Let me use and modify
the code you posted on wiki to see if I can get a better
throughput.
I'll write the list again once I have some performance data.
-Ankur
-Original Message-
From:
Hi
I use hbase 0.2. devel from trunk and hadoop 0.17, and i lost all tables
after restart hbase.
I do :
1) start hadoop dfs
2) start hbase
3) create table X
4) make insert to table X
5) select from X - there are data inserted in 4 (everything is ok)
6) stop hbase
7) stop
Looks like you are crawling the web. What crawler are you using?
Could you write direct into hbase from the crawler?
St.Ack
Goel, Ankur wrote:
Thanks for the explanation Stack. Using my threaded client
I got a throughput of 6000 inserts/sec. Let me use and modify
the code you posted on
The pattern of events as you list them is the correct way to bring up
and down an HBase cluster.
Is this being run on a single node or multiple machines? What command
are you using to start HBase? (bin/start-hbase.sh is what I use) Is
there anything interesting in the HBase logs for either
I am crawling the web indeed, but only the sites
that are present in my seedlist. The crawler used
here is heritrix 2.0 -
http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
I developed a Heritrix specific HBase writer that can be integrated with
Heritrix to write the crawled content
I have some familiarity with that crawler.
Tell us more about your writer. Is it proprietary? If not, can we get
it into a place where others could use it if wanted?
Thanks,
St.Ack
Goel, Ankur wrote:
I am crawling the web indeed, but only the sites
that are present in my seedlist. The
Hi Bryan,
Here is the sample schema I have (looks closer to RDBMS, I
know)
TABLE: seed_list
DESCRIPTION: Used to store seed urls (both old and newly discovered).
Initially populated with some seed URLs. The crawl
controller
picks up the seeds from