Re: Creating HBase table with presplits

2016-12-13 Thread Sachin Jain
Thanks Saad!! This is exactly similar to what I had planned to implement i.e to map your known keyspack to known keyspace by using a hash algorithm like MD5. Then split the table. Thanks once again!! On Fri, Dec 2, 2016 at 7:18 PM, Saad Mufti wrote: > Forgot to mention

Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
Forgot to mention in above example you would presplit into 1024 regions, starting from "" to "1023" (start keys). Cheers. Saad On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti wrote: > One way to do this without knowing your data (still need some idea of size > of

Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
One way to do this without knowing your data (still need some idea of size of keyspace) is to prepend a fixed numeric prefix from a suitable range based on a good hash like MD5. For example, let us say you can predict your data will fit in about 1024 regions. You can decide to prepend a prefix

Re: Creating HBase table with presplits

2016-11-29 Thread Sachin Jain
Thanks Dave for your suggestions! Will let you know if I find some approach to tackle this situation. Regards On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham wrote: > If you truly have no way to predict anything about the distribution of your > data across the row key space,

Re: Creating HBase table with presplits

2016-11-28 Thread Dave Latham
If you truly have no way to predict anything about the distribution of your data across the row key space, then you are correct that there is no way to presplit your regions in an effective way. Either you need to make some starting guess, such as a small number of uniform splits, or wait until

Creating HBase table with presplits

2016-11-28 Thread Sachin Jain
Hi, I was going though pre-splitting a table article [0] and it is mentioned that it is generally best practice to presplit your table. But don't we need to know the data in advance in order to presplit it. Question: What should be the best practice when we don't know what data is going to be