[jira] [Commented] (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040013#comment-13040013 ] Himanshu Vashishtha commented on HBASE-3607: So, I did some experiments with coprocessors. I wrote a blog describing them: http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html Thanks, Himanshu Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt, patch-3607-3.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029480#comment-13029480 ] Himanshu Vashishtha commented on HBASE-3607: I think I need to work more on boundary cases. Will do after 16th May. Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt, patch-3607-3.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020055#comment-13020055 ] Himanshu Vashishtha commented on HBASE-3607: Thanks for the review Gary, really appreciate your time and effort. Some key points that I got are: a) Flesh out a neater client side API, where the client gets a handler to invoke iteration methods. It should be as simple as that. Giving it a long integer id does not help. b) In executing the above calls, use the existing cp RPC mechanism (so get rid of CursorCallable). c) Use existing code for scanner and other stateful objects at RS. Can RegionObserver be used for maintaining these objects? I am in the process of coming up with a better approach and facing one design question at server side. It will be great to have comments on it: a) When it comes to maintaining stateful scanners at RS side, we are dealing with instances of Internal scanners that are created to do scans on a region basis. They as such can't be registered at RegionServer because a region has only limited access to its HRS (via RegionServerServices). The idea of having these objects stored at RS level has at least two benefits: i) current scanners are registered this way (use existing code). ii) these internal scanners will be instantiated per region, so if we try to register (house keep) them in a cp, we will be having that many lease objects (a daemon threads) which is not justifiable; or a timer object or so to do the resources in check. So, these stateful scan objects should be registred at RS level. To do so, a region (or the CP) should have access to RS's APIs which does this job like addScanner(InternalScanner). Currently it has RegionServerServices, but it can't be used to do the registering of these scan objects. One approach is add such a method in HRS and then either add a method in RS (or refactor existing addScanner method appropriately). Is this a right way or is there other better approach to do so. Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13011346#comment-13011346 ] Gary Helmling commented on HBASE-3607: -- I posted a review of this on review board: https://review.cloudera.org/r/1624/ (We really need to get the damn email spam scoring fixed). Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006804#comment-13006804 ] Himanshu Vashishtha commented on HBASE-3607: First, thanks for reviewing it Stack. Sorry for not making its requirements very clear in the description. You asked: What is CursorCallable adding over and above Scanner? Its not clear to me (Pardon me). A scanner is to read the raw (virgin) rows of the table, and one can add filters etc to do the sieving. A cursor is to traverse a computed resultset, that is a result of some CP computation. This is useful in cases when instead of getting one value as the post computation result at region level (like the agg functions), the resultset is bunch of rows. This cursor thing provides a mechanism to consume this computed resultset (by sending it to the client in a piece wise manner), and if necessary asking the CP to produce more of the result. Therefore, it supports two types of ResultSets: Incremental and InMemory. Incremental: In this case, results can be generated on a per row (or a group of rows) basis. For example, the test case used in the patch. If a client says give me 100 rows in one rpc, the corresponding cursor object will give exactly that much number of rows in the next call. InMemory: This is like computing top K rows in one region. Here, the resultset _has_ to be precomputed before the cursor object is instantiated and the handle is given to the client. Once the result set is created, a cursor object is created. Invoking next() like methods will only consume the resultset (as it is already computed on the entire region. Hope this clarification will be useful. yes, in the current patch, its fail fast in case of a region split (just abandons the process and leave it to the client to re-submit the request). Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3607) Cursor functionality for results generated by Coprocessors
[ https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006796#comment-13006796 ] stack commented on HBASE-3607: -- So, what happens if the region moves mid-cursor-scan? What is CursorCallable adding over and above Scanner? Its not clear to me (Pardon me). You are inconsistent in your formatting: {code} +if(cache.size() ==0 this.closed) + return null; +if(cache.size() ==0){//do a rpc and fetch results + Result[] res = this.htable.getConnection().getRegi {code} These are pretty radical additions: {code} + /* + * get result from cp cursor + */ + public Result[] nextCp(long cursorId, int cache) throws IOException; + /** + * closing the associated cursor object and release its region level resources + * @param cursorId + * @throws IOException + */ + public void closeCp(long cursorId) throws IOException; {code} Are they necessary? Why do we have to mod the HRegion when we have CPs now? Yeah, same for these additions to HRegionServer. I do not see the direct benefit to all these big changes Himanshu. Help me understand. Cursor functionality for results generated by Coprocessors -- Key: HBASE-3607 URL: https://issues.apache.org/jira/browse/HBASE-3607 Project: HBase Issue Type: New Feature Components: coprocessors Reporter: Himanshu Vashishtha Attachments: patch-2.txt I tried to come up with a scanner like functionality for results generated by coprocessors at region level. This is just a poc, and it will be good to have your comments on it. It has support for both Incremental and In-memory Result sets. Attached is a patch that has a test case for an incremental result (i.e., client receives a cursorId from the CP core method, it instantiates a cursor object and iterates over the result set. He can set a cache limit on the CursorCallable object to reduce the number of rpc -- just like scanners. In its current state, it has some limitations too :)), like, it is region specific only, i.e., one can instantiate and use cursor at one region only (and that region is determined by the input row while instantiating the cursor). I will try to expand it so that it can have atleast a sequential access to other regions, but as I said, I want the opinion of experts to know whether this approach really makes some sense or not. I have tested it with the inbuilt testing framework on my laptop only. It will be good if I copy the use case here in the description too: Test table has rows like: /** * The scenario is that I have these rows keys in the test table: 'aaa-123' 'aaa-456' 'abc-111' 'abd-111' 'abd-222' I want to return: ('aaa', 2) ('abc', 1) ('abd', 2) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira