Secondary index

2011-03-19 Thread Wade Arnold
This is two questions for one solution. I want to put a blog post together on this if it's right. I am playing with htable.batch for multi get to see if I can remove my external hbase indexes. This is what I am trying to do. #1 What is the best model for a column family that is just used as a

Re: hbase insertion optimisation:

2011-03-19 Thread Ted Yu
Timestamp is in every key value pair. Take a look at this method in Scan: public Scan setTimeRange(long minStamp, long maxStamp) Cheers On Sat, Mar 19, 2011 at 3:43 PM, Oleg Ruchovets wrote: > Good point , > let me explain the process. We choose the keys _ > because after insertion w

Re: hbase insertion optimisation:

2011-03-19 Thread Oleg Ruchovets
Good point , let me explain the process. We choose the keys _ because after insertion we run scans and want to analyse data which is related to the specific date. Can you provide more details using hashing and how can I scan hbase data per specific date using it. Oleg. On Sun, Mar 20,

Re: Newbie question concerning schema design

2011-03-19 Thread Niels Nuyttens
Thank you both for your replies. I took a look at the information you pointed me to, and it already helped me quite a lot. For now, I still have these questions: How do I deal with 'nested' one-to-one relationships? I'm talking about a following case: a patient has many episodes

Re: hbase insertion optimisation:

2011-03-19 Thread Ted Yu
I guess you chose date prefix for query consideration. You should introduce hashing so that the row keys are not clustered together. On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets wrote: > We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop append). > currently we have ~ 10 millio

hbase insertion optimisation:

2011-03-19 Thread Oleg Ruchovets
We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop append). currently we have ~ 10 million records per day.We use map/reduce to prepare data , and write it to hbase using chunks of data (5000 puts every chunk) All process takes 1h 20 minutes. Making some tests verified that wri

Re: hbase heap size

2011-03-19 Thread Oleg Ruchovets
Thank you St.Ack the question is regarding setting heap size for hbase: As I understand there are 3 processes HBASE master , Hbase Region server , Zookeper. What is the heap size should I set for these processes? I don't remember where do I see 4000m was recommended , but does it mean that all

Re: hbase heap size

2011-03-19 Thread Stack
See this section in your hbase-env.sh: # export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false" # export HBASE_MASTER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101 -javaagent:lib/HelloWorldAgent.jar" # export HBASE_REGIO

hbase heap size

2011-03-19 Thread Oleg Ruchovets
Hi , we started our tests on cluster ( hbase 0.90.1 , hadoop append) , I set HBASE_HEAPSIZE to 4000m in hbase-env.sh and got 3 processes which has heap size 4000m: my questions are: 1)What is the way to set separately heap size for these processes. In case I want to give to zookeper less h

Re: Newbie question concerning schema design

2011-03-19 Thread Stack
There is also this small section in our book: http://hbase.apache.org/book/schema.html It refers to a useful paper by Ian Varley on modelling in non-rdbms dbs. Sounds like it would be good preparatory reading for the project you've just started. St.Ack On Sat, Mar 19, 2011 at 9:30 AM, Ted Yu w

Re: File formats in Hadoop

2011-03-19 Thread Weishung Chung
Thank you for the info, HFile looks interesting, can't wait to dig into the code and get a better understanding of HFile ! On Sat, Mar 19, 2011 at 11:28 AM, Harsh J wrote: > Hello, > > On Sat, Mar 19, 2011 at 9:31 PM, Weishung Chung > wrote: > > Is all data written through hadoop including thos

Re: Newbie question concerning schema design

2011-03-19 Thread Ted Yu
See: http://search-hadoop.com/m/zbKmE14o0Js/wide+tall+hbase+table&subj=Re+Parent+child+relation+go+vertical+horizontal+or+many+tables+ You can also search for related discussion on tall vs. wide tables. On Sat, Mar 19, 2011 at 8:53 AM, Niels Nuyttens wrote: > Hi all, > > I'm need a database scal

Re: File formats in Hadoop

2011-03-19 Thread Harsh J
Hello, On Sat, Mar 19, 2011 at 9:31 PM, Weishung Chung wrote: > Is all data written through hadoop including those from hbase saved in the > above formats? It seems like SequenceFile is in key value pair format. HBase provides its own format called HFile. See http://hbase.apache.org/apidocs/org/

schema WAS: Stargate and Hbase

2011-03-19 Thread Ted Yu
sreejith: I leave your second question to other experts. Let me try to answer schema question. You didn't mention how URLs and keywords scale (there're 1 trillion URLs in the world). So I base my suggestion on what you outlined. First you need to use hash/index to represent each URL. You can then

File formats in Hadoop

2011-03-19 Thread Weishung Chung
I am browsing through the hadoop.io package and was wondering what other file formats are available in hadoop other than SequenceFile and TFile? Is all data written through hadoop including those from hbase saved in the above formats? It seems like SequenceFile is in key value pair format. Thank y

Newbie question concerning schema design

2011-03-19 Thread Niels Nuyttens
Hi all, I'm need a database scaled for large datasets and high throughput. HBase seemed like the way to go. However, while designing my database schema I started to doubt my choice, due to the conversion of the current relational schema to a NoSQL variant. I can't get my head around the efficient

Re: Bulk Load question.

2011-03-19 Thread Harsh J
Have you tried out the mix of importtsv + completebulkload? Would that work for you? On Sat, Mar 19, 2011 at 9:18 PM, Vivek Krishna wrote: > I have around 20 GB of data to be dumped into a hbase table. > > Initially, I had a simple java program to put the values in a batch of > (5000-1) recor

Bulk Load question.

2011-03-19 Thread Vivek Krishna
I have around 20 GB of data to be dumped into a hbase table. Initially, I had a simple java program to put the values in a batch of (5000-1) records. I tried concurrent inserts and each insert took about 15 seconds to write. Which is very slow and was taking ages. Next approach was to use i