[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-08-01 Thread Ian Michael Gumby (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650544#comment-14650544
 ] 

Ian Michael Gumby commented on HBASE-12853:
---

Wow, 

Rather than try to stay focused on the issue of the Jira, you talk about 
contributing to open source. 

I can tell you the answer, I can even explain it to you, but you still wouldn't 
get it. 



> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-31 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649476#comment-14649476
 ] 

Michael Segel  commented on HBASE-12853:



Andrew, 

As you point out, it was a trivial solution and that was the point I was trying 
to make, that you took the time to work on it. 


As I've said repeatedly, I can't provide patches because the risks outweigh the 
benefits. (Lets leave it at that.) 


I guess at the time I wrote this enhancement request, I could have raised this 
issue with a certain vendor's support team, then suggested that a certain 
person call a certain person to ask that this get done... but that would have 
been a waste of calling in a favor. 

Again, either the committers or community  sees the benefits and merits in 
doing this... or you don't. It was a five minute thought that wasn't worth the 
effort of diagramming out on a white board that solved a problem. 


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-31 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649455#comment-14649455
 ] 

Lars Hofhansl commented on HBASE-12853:
---

Most committers have well paying jobs and won't risk leaving them either. The 
employer also would be exposed to the very same risk (amplified, because 
there's more money to make).
I have personally many discussions with our legal team(s) about this. So I do 
know what I am talking about. 

Most people fail to calculate the cost of legal risk and assume it to be 
infinite.

I get consulting gigs offered all the time _because_ I commit to open source 
(since I am employed I cannot accept such gigs, but that's not the point here). 
It's all about how you set it up with your customers. 

Sorry you feel this way. Contributing is what makes open source work. If 
everybody would think like you there would be no open source.

In any case this is not the right place to discuss this.


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-31 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649304#comment-14649304
 ] 

Michael Segel  commented on HBASE-12853:


Lars, 

If I respond, I'll be called argumentative. If I don't respond, it will leave 
readers with the incorrect perception. 

Again, Apache does not indemnify the contributor, leaving you with risk.  You 
need to balance that risk against the benefits of contributing. 

Its a lot simpler to say "Apache won't indemnify me..." than to continually 
having to write out long paragraphs as to why and what that really means. 
Either you get it or you don't.  Most of the committers here don't run their 
own shop or have to deal with the business side of software. 




> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-30 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648174#comment-14648174
 ] 

Andrew Purtell commented on HBASE-12853:


bq.  Either you find value to the suggestion or not. That is your call. But 
please note that Andrew P. worked on 
https://issues.apache.org/jira/browse/HBASE-13044. (Also relatively trivial)

Not sure I understand the relevance. For the record, I filed that issue after a 
brief encounter with Jim Scott of MapR over on the OpenTSDB list. He spoke of 
customers implementing coprocessors that exist solely to prevent loading of any 
other coprocessors, so I thought we could do something simple to make that 
unnecessary and volunteered time to do it. Strictly speaking, I didn't have to 
but the conversation was respectful and interesting and I felt like 
volunteering some of my evening that evening rather than spend it with family.

The committer role at Apache is not about requiring individuals to implement 
unfunded mandates from random folks. On the other hand, we are expected to try 
and assess all contributions in the form of a patch in the most impartial 
manner possible. If for whatever reason you are not in a position to provide a 
patch, that's fine, but understand you are speaking to a community of 
volunteers who have work and personal lives and are already being super 
generous just for showing up here from time to time. You'll have to find a way 
to convince them they should volunteer their time to help you. Sometimes under 
the best of circumstances that just won't happen. An abrasive communication 
style - for example, repeated comments about "lack[ing] the patience to suffer 
fools" - dooms you to failure out of the gate. Don't be surprised at your lack 
of results.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647980#comment-14647980
 ] 

Lars Hofhansl commented on HBASE-12853:
---

bq. As I have stated repeatedly, I am unable to contribute to certain Apache 
projects unless Apache is willing to indemnify me. (Which they are not.) 

Don't be ridiculous. It is always your task to clear with all possible IP 
owners before you contribute anything under any license.
If you have something to contribute show us the code or even just a spec, 
otherwise it's just useless noise; if not just leave it instead of now blaming 
the committers with specious excuses why you can't do it.

I'm going to close this.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-29 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646617#comment-14646617
 ] 

Michael Segel  commented on HBASE-12853:


@Anoop,

Yes, that is correct. 
It was my misunderstanding on the client/server break. 
(I program to the APISs and don't look at the source code.) 

I believe I did mention this after your last post correcting my mistake.

Again, this is pretty simple... you're overloading the scan() so that it first 
does a check to see if the underlying table is bucketed or not.  A simple way 
to do this is to check the number of buckets. If its 0, then its not bucketed 
and you just run the scan like normal.  If it is a non-negative, non-zero 
integer, you would then parallelize the scan.

You would then need to wait until all of the result sets return before you can 
funnel the data in to a single result set to be returned to the user. 

Of course I'm assuming that each result set will start to send back results 
prior to completion of the ensuing scan. 
Note too that these will be range scans. 

One other side effect is that if the scan is a full table scan... things will 
get a bit messy. (We'll maybe not... )

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-29 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646411#comment-14646411
 ] 

Anoop Sam John commented on HBASE-12853:


As per the discussion in the Jira comments, we can not do this as a server side 
feature. This will be a client side thing.  Priority can be marked minor or 
major that is not the main thing IMHO.  What matters is the a small doc abt the 
approach and patch. Many of us will be happy to review that when it comes. As 
far as some feature are value added for the team,we all are open for those.   
Are you going to work on this?   If not there is no point in keeping it open.  
We can see any one else willing to take this up. If none better close it as 
later/wont implement.  

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-29 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646347#comment-14646347
 ] 

Michael Segel  commented on HBASE-12853:


@Sean, 

As I have said before... Apache doesn't indemnify committers (actually its the 
reverse) and there is no upside for me to offset the risk. 

In a nutshell it would be pointless in having a discussion on why I used the 
term trivial and why I rated this as a low priority. 

BTW, there are 11 watchers... why don't you ask those watchers who are also 
committers and leaders of the HBase project, why they didn't raise the 
priority? 

I don't wish to seem rude, but if you're going to lecture someone, you had 
better realize that some will ignore you, others will mock you... 

To your point, this was the first JIRA that I raised.  I assumed that those who 
volunteer their time would also take the time to assess the value of the 
suggestion.  Clearly not.  That was my mistake. 

To be honest, I lack the patience to suffer fools...  




> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
> Fix For: 2.0.0
>
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-07-20 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633793#comment-14633793
 ] 

Michael Segel  commented on HBASE-12853:


Nothing new? 

Seriously?

This is a trivial feature.


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-04-02 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392726#comment-14392726
 ] 

Michael Segel  commented on HBASE-12853:


Sorry, I thought that the HM had the META table cached in memory. Didn't think 
that the META was too large

Ok, so then it looks like what I want to do is all client side then. 

The design is pretty straight forward. 

The number of buckets is fixed at the time of table creation. 
The row key is a composite key of  bucket_id | rowkey  and the bucket_id is 
derived from taking the modulus N of the first byte of the row key. (Giving you 
0xFF(255) max buckets. ) Then when you want to fetch a single row given the 
rowkey, you can find the bucket and fetch the single row. If you need to do a 
scan, given the start row, you can then create N parallel threads and within 
each thread, start the scan by prepending the bucket_id |  to the start rowkey.

When returning the result set, you can then strip off the bucket_id | and take 
the MIN(value(n)) value(n) is the next row from each scanner, popping it off 
the stack. This will give you a single result that is guaranteed to still be 
within sort order. 

Its all client side and it abstracts the bucketing from the user/client code so 
that the same code will run against either table without any changes.



> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-04-02 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392727#comment-14392727
 ] 

Michael Segel  commented on HBASE-12853:


Sorry, the value '(' n ')' gets translated in to a downvote (n).


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-03-27 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385112#comment-14385112
 ] 

Anoop Sam John commented on HBASE-12853:


Just one correction..  The client side has to contact the META (single) region 
to determine the regions and their location for the scan. So not HM.  (If HM is 
acting as another RS and holding META region, then yes it goes to HM.)  So 
where is META region sits matters. Hope am making it clear now.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-03-27 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384111#comment-14384111
 ] 

Michael Segel  commented on HBASE-12853:


To add to my comment in response to Anoop, 

I wanted to abstract the concept of a scan from the application.  Normally 
you'd do this on the server side of the client/server split, however... w.r.t 
HBase, this becomes a bit more difficult. 

Ok... 
So if I understand this.
Client passes scanner object to instance of table in the table.scan(scanner) 
call. 
So still on the client side, the  client's table object will then connect with 
the HMaster and determine which region and region server is required to start 
the table scan, and then the client connects directly to the region and starts 
the scan? 

What do you call the scanner object that's running on the region? 

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-03-27 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384102#comment-14384102
 ] 

Michael Segel  commented on HBASE-12853:


Ok, 
So then when a scanner object is passed from the client to the server the 
client will ask the HMaster for the region(s) that satisfy the scan, or just 
the first region? 

This would imply that when running a m/r that the m/r program will ask the 
HMaster for the regions and then will create a split for each region in the 
list and then each mapper task will initiate its own scan over a specific 
region? 

Ok... on one level for m/r that makes sense because you wouldn't want 1000 
mappers trying to coordinate queries with the HMaster at the same time because 
it could become a bottleneck. 

On the other side, if you're using HBase as a database outside of Map/Reduce, 
you'd want to have a query engine that would abstract the underlying workings 
of a scan from the client. 



> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-03-16 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363448#comment-14363448
 ] 

Anoop Sam John commented on HBASE-12853:


Pls note that in reads HBase Master wont do any co-ordination or so.  The 
client talks directly with RS where its intended region resides.  

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-02-20 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329576#comment-14329576
 ] 

Michael Segel  commented on HBASE-12853:


I should also add that this is one area that one must take caution in the 
design because if not done properly or cleanly, it will kill performance. 


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-02-20 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329574#comment-14329574
 ] 

Michael Segel  commented on HBASE-12853:


The design seems straight forward, at least as to a starting point. (YMMV) 

The client will create a reference to a table and then instantiate a scanner 
object along with any associated filters. 
The client then passes this object to the server expecting a result set to be 
returned. 

On the server side, it seems that the HBase Master (active) gets the scan 
request and then starts to do the heavy lifting. 

By providing more intelligence to this process, its possible to do more than 
just allow for bucketed tables to abstract the buckets and act as if its a 
regular table. 
The key question is how to best redesign this initial entry point to allow for 
such extensibility.
 

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-26 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291765#comment-14291765
 ] 

Michael Segel  commented on HBASE-12853:


Before we go in to a design, I need to get a bit more information. 
As a practice, I don't review HBase source code and work from the exposed APIs. 
Of course looking at the HBase API these days is a bit of a CF since most of 
the APIs are deprecated referring to other deprecated classes / interfaces etc 
... not to mention there a couple of different releases...

So we start with a Connection instance which we get a instance of class Table 
for the given table. 
Ignoring put() for a moment,  we have get() and getScanner() methods. 

What happens on the server side of the connection when the client calls 
getScanner() or get() ? 

Part of the issue is that a simple scanner won't work right unless you end up 
preprocessing it and treating it as a scanner but with a default (blank) set of 
filters. 

So while I can walk you through the logic and give you a resulting diagram, I 
need a committer who's familiar with the server side workings.  Then it should 
be a pretty straight forward thing to implement. 

-Mike


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-19 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282337#comment-14282337
 ] 

Michael Segel  commented on HBASE-12853:


Sure... 

Just a couple of things... 
1) I would like to make sure I understand the split between client/server in 
HBase works the way I think it does. 

2) I get some free time. (Day Job, conference talks, R&D, ...) 

This is one issue that is specific to HBase and doesn't conflict with any prior 
work I may have done. 


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-16 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281074#comment-14281074
 ] 

Lars Hofhansl commented on HBASE-12853:
---

Let's see a design :)

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-16 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280461#comment-14280461
 ] 

Michael Segel  commented on HBASE-12853:


"An implemented one is OneBytePrefixKeySalter, where the prefix is 
hash(RowKey)%buckets" 

That's fine. But now if I have another client, I have to know that the table is 
bucketed. (Yes, I am refusing to use the term salt when talking about this... 
:-)

And not only do you need to know that the table is bucketed, you need to know 
the number of buckets.  You are also assuming that the individual is using a 
java application to query the data.  What happens if they are not? 
And that they've got the Intel library. 

If its done server side all of that goes away.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-16 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280440#comment-14280440
 ] 

Michael Segel  commented on HBASE-12853:



First, lets get away from using the term salted. 
Salts do have a specific meaning and its associated with cryptography. While 
we're clearly not talking about cryptography, it implies that the prefix is 
orthogonal to the data set and the number of salted values is bound by the 
width of the prefix. 

Using the term bucketing the table would be more appropriate because in this 
example, you're assigning a prefix from a round robin approach. 

I have to apologize, I don't play with HBase that much these days... my work is 
client driven.
With respect to client/server it seems that the delineation between client and 
server appears to be a bit different from what I would expect from other 
databases.   In HBase, the client creates a scan, and then has the hmaster will 
manage the scan and return a pointer to the result set? 

With respect to the client side code... you're missing the point. You want to 
abstract the bucketing from the client. So that the same scan will run against 
a bucketed table and an un-bucketed table. The only exposed difference is that 
the metadata for the table will specify the number of buckets which defaults to 
1 (no bucketing) 



> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Liu Shaohui (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278320#comment-14278320
 ] 

Liu Shaohui commented on HBASE-12853:
-

Intel hadoop team has opensourced a salted table implement at 
https://github.com/intel-hadoop/SaltedHTable.
It is also a client-slide library and the code is very clean.

Personally, a client-slide library is simple enough for salted table.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277488#comment-14277488
 ] 

Lars Hofhansl commented on HBASE-12853:
---

The coprocessors are per region, and you want the "salting" for spreading 
across regions. So you mean to have some region server contact other region 
server in order to execute a portion of a scan there?

Phoenix does the parallelization on the client and then farms out the work to 
the various region servers, which then execute the requests with the help of 
per region coprocessors.

Would be nice to completely hide this. We might have to invent something now 
for that.


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277413#comment-14277413
 ] 

Michael Segel  commented on HBASE-12853:


Lars, 
No it will be all server side. 
That's the beauty of it. The client won't know anything about the underlying 
differences. 

Today, you can easily do this client side and then you have the responsibility 
for managing the N scanners and merging the result set(s). The idea is to do 
this server side so that clients won't need to know any of the details. 

Again, Phoenix implies that it does something like this. However, having a 
tighter coupling to HBase would mean that there is no client side changes.  
Clients would have one API to get data from a regular table or one that used 
buckets. The only difference would be in the table definition and parameters 
for the table. 

Does that make sense? 


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277341#comment-14277341
 ] 

Lars Hofhansl commented on HBASE-12853:
---

Thanks [~msegel]. It would most be client side code, right? I.e. prefixing keys 
before issuing the writes and performing the right fanning out upon scanning. I 
don't think that would need any server-side logic (a.k.a. coprocessors), but I 
might be wrong.

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277024#comment-14277024
 ] 

Michael Segel  commented on HBASE-12853:


Note that some of this may actually be in Phoenix so it could be redundant...
http://phoenix.apache.org/salted.html
Implies some of this... but does not go in to detail...


> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12853) distributed write pattern to replace ad hoc 'salting'

2015-01-14 Thread Michael Segel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276897#comment-14276897
 ] 

Michael Segel  commented on HBASE-12853:


On second thought rather than try to automate the number of regions to be used 
in the prefix, it may just be easier to define a parameter that contains the 
number of parallel buckets. (Apologies for using a very loose terminology.) We 
could say buckets or parallelization factor.  

We may have 100 RS but only want to use a parallel factor of 10 which could be 
enough to alleviate the hot spotting. It also makes it easier if the size of 
the cluster is relatively dynamic with the adding and subtracting of RS. 

Also apologies if this concept has been already raised. 

> distributed write pattern to replace ad hoc 'salting'
> -
>
> Key: HBASE-12853
> URL: https://issues.apache.org/jira/browse/HBASE-12853
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Segel 
>Priority: Minor
>
> In reviewing HBASE-11682 (Description of Hot Spotting), one of the issues is 
> that while 'salting' alleviated  regional hot spotting, it increased the 
> complexity required to utilize the data.  
> Through the use of coprocessors, it should be possible to offer a method 
> which distributes the data on write across the cluster and then manages 
> reading the data returning a sort ordered result set, abstracting the 
> underlying process. 
> On table creation, a flag is set to indicate that this is a parallel table. 
> On insert in to the table, if the flag is set to true then a prefix is added 
> to the key.  e.g. - or  server # is an integer between 1 and the number of region servers defined.  
> On read (scan) for each region server defined, a separate scan is created 
> adding the prefix. Since each scan will be in sort order, its possible to 
> strip the prefix and return the lowest value key from each of the subsets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)