Recently we upgraded to CDH5.3.8, and needed to use CDAP Readless Increments (https://github.com/caskdata/cdap) to overcome the recently-fixed performance regression around Increment.
We are now looking to upgrade to CDH5.5.x, and have attempted to upgrade 1 slave in our 5.3.8 cluster to 5.5.0. Unfortunately on this slave we are seeing errors like below, for use cases that work perfectly fine on the rest of the cluster: https://gist.github.com/bbeaudreault/5214b28319981c18379f We will see this typically within minutes of the server starting. The exception originally comes from CDAP's IncrementSummingHandler invocation, but eventually falls through to the same underlying hbase libraries as the stacktrace above (the above is from my test code -- see below -- in an attempt to exonerate cdap). In debugging, I've done the following to rule out CDAP's wrappers as a possible problem: try { // cdap code } catch (AssertionError e) { // Do a normal HRegion.get(get, false) and HRegion.getScanner() } Both of the followup raw get and scans will also throw the same AssertionError. HOWEVER, if I first do a HRegion.flushcache(true), those same raw get/scan will then work (returning 0 results however). I understand that at this point we are using a modified version of an unofficial coprocessor and it will be hard for you guys to provide a real clear solution without knowing the full environment. I'm hoping to just get some guidance, as I'm in the midst of debugging this and have a few questions I was hoping you guys might have knowledge of: - Where does OLDEST_TIMESTAMP/Minimum come from? I see it in the code, but don't really see any KeyValues instantiated with it in any code paths that I am familiar with. - Any ideas on how to inspect the full result that is causing this issue? I'm in an unfortunate position right now in which I do not know how to reproduce the issue (the table in question gets a wide range of usage patterns from multiple clients). Since it seems a flushcache() call fixes the issue (though oddly returns 0 results afterward) it seems like I can't just read a raw hfile -- the problem may be in the memstore? - Any thoughts/hints on how this AssertionError could possibly be triggered which I could use to inform my investigation? Any thoughts would be greatly appreciated. Thanks!