As an update, I raised the tablet server memory and I have not seen this error thrown since. I'd like to say raising the memory, alone, was the solution but it appears that I also may be having some performance issues with the switches connecting the racks together. I'll update more as I dive in further.
On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet <cjno...@gmail.com> wrote: > Josh, > > Your advice is definitely useful- I also thought about catching the > exception and retrying with a fresh batch writer but the fact that the > batch writer failure doesn't go away without being re-instantiated is > really only a nuisance. The TabletServerBatchWriter could be designed much > better, I agree, but that is not the root of the problem. > > The Thrift exception that is causing the issue is what I'd like to get to > the bottom of. It's throwing the following: > > *TApplicationException: applyUpdates failed: out of sequence response * > > I've never seen this exception before in regular use of the client API- > but I also just updated to 1.6.0. Google isn't showing anything useful for > how exactly this exception could come about other than using a bad > threading model- and I don't see any drastic changes or other user > complaints on the mailing list that would validate that line of thought. > Quite frankly, I'm stumped. This could be a Thrift exception related to a > Thrift bug or something bad on my system and have nothing to do with > Accumulo. > > Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had > seen the exception before and may remember what it was/how they fixed it. > > > On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser <josh.el...@gmail.com> wrote: > >> Don't mean to tell you that I don't think there might be a bug/otherwise, >> that's pretty much just the limit of what I know about the server-side >> sessions :) >> >> If you have concrete "this worked in 1.4.4" and "this happens instead >> with 1.6.0", that'd make a great ticket :D >> >> The BatchWriter failure case is pretty rough, actually. Eric has made >> some changes to help already (in 1.6.1, I think), but it needs an overhaul >> that I haven't been able to make time to fix properly, either. IIRC, the >> only guarantee you have is that all mutations added before the last flush() >> happened are durable on the server. Anything else is a guess. I don't know >> the specifics, but that should be enough to work with (and saving off >> mutations shouldn't be too costly since they're stored serialized). >> >> >> On 8/22/14, 5:44 PM, Corey Nolet wrote: >> >>> Thanks Josh, >>> >>> I understand about the session ID completely but the problem I have is >>> that >>> the exact same client code worked, line for line, just fine in 1.4.4 and >>> it's acting up in 1.6.0. I also seem to remember the BatchWriter >>> automatically creating a new session when one expired without an >>> exception >>> causing it to fail on the client. >>> >>> I know we've made changes since 1.4.4 but I'd like to troubleshoot the >>> actual issue of the BatchWriter failing due to the thrift exception >>> rather >>> than just catching the exception and trying mutations again. The other >>> issue is that I've already submitted a bunch of mutations to the batch >>> writer from different threads. Does that mean I need to be storing them >>> off >>> twice? (once in the BatchWriter's cache and once in my own) >>> >>> The BatchWriter in my ingester is constantly sending data and the tablet >>> servers have been given more than enough memory to be able to keep up. >>> There's no swap being used and the network isn't experiencing any errors. >>> >>> >>> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <josh.el...@gmail.com> >>> wrote: >>> >>> If you get an error from a BatchWriter, you pretty much have to throw >>>> away >>>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990. >>>> If >>>> you want, you should be able to catch/recover from this without having >>>> to >>>> restart the ingester. >>>> >>>> If the session ID is invalid, my guess is that it hasn't been used >>>> recently and the tserver cleaned it up. The exception logic isn't the >>>> greatest (as it just is presented to you as a RTE). >>>> >>>> https://issues.apache.org/jira/browse/ACCUMULO-2990 >>>> >>>> >>>> On 8/22/14, 4:35 PM, Corey Nolet wrote: >>>> >>>> Eric & Keith, Chris mentioned to me that you guys have seen this issue >>>>> before. Any ideas from anyone else are much appreciated as well. >>>>> >>>>> I recently updated a project's dependencies to Accumulo 1.6.0 built >>>>> with >>>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest >>>>> component which is running all the time with a batch writer using many >>>>> threads to push mutations into Accumulo. >>>>> >>>>> The issue I'm having is a show stopper. At different intervals of time, >>>>> sometimes an hour, sometimes 30 minutes, I'm getting >>>>> MutationsRejectedExceptions (server errors) from the >>>>> TabletServerBatchWriter. Once they start, I need to restart the >>>>> ingester >>>>> to >>>>> get them to stop. They always come back within 30 minutes to an hour... >>>>> rinse, repeat. >>>>> >>>>> The exception always happens on different tablet servers. It's a thrift >>>>> error saying a message was received out of sequence. In the >>>>> TabletServer >>>>> logs, I see an "Invalid session id" exception which happens only once >>>>> before the client-side batch writer starts spitting out the MREs. >>>>> >>>>> I'm running some heavyweight processing in Storm along side the tablet >>>>> servers. I shut that processing off in hopes that maybe it was the >>>>> culprit >>>>> but that hasn't fixed the issue. >>>>> >>>>> I'm surprised I haven't seen any other posts on the topic. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>> >