Re: Tablet server thrift issue

2014-09-01 Thread Corey Nolet
As an update,

I raised the tablet server memory and I have not seen this error thrown
since. I'd like to say raising the memory, alone, was the solution but it
appears that I also may be having some performance issues with the switches
connecting the racks together. I'll update more as I dive in further.


On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet cjno...@gmail.com wrote:

 Josh,

 Your advice is definitely useful- I also thought about catching the
 exception and retrying with a fresh batch writer but the fact that the
 batch writer failure doesn't go away without being re-instantiated is
 really only a nuisance. The TabletServerBatchWriter could be designed much
 better, I agree, but that is not the root of the problem.

 The Thrift exception that is causing the issue is what I'd like to get to
 the bottom of. It's throwing the following:

 *TApplicationException: applyUpdates failed: out of sequence response *

 I've never seen this exception before in regular use of the client API-
 but I also just updated to 1.6.0. Google isn't showing anything useful for
 how exactly this exception could come about other than using a bad
 threading model- and I don't see any drastic changes or other user
 complaints on the mailing list that would validate that line of thought.
 Quite frankly, I'm stumped. This could be a Thrift exception related to a
 Thrift bug or something bad on my system and have nothing to do with
 Accumulo.

 Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
 seen the exception before and may remember what it was/how they fixed it.


 On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote:

 Don't mean to tell you that I don't think there might be a bug/otherwise,
 that's pretty much just the limit of what I know about the server-side
 sessions :)

 If you have concrete this worked in 1.4.4 and this happens instead
 with 1.6.0, that'd make a great ticket :D

 The BatchWriter failure case is pretty rough, actually. Eric has made
 some changes to help already (in 1.6.1, I think), but it needs an overhaul
 that I haven't been able to make time to fix properly, either. IIRC, the
 only guarantee you have is that all mutations added before the last flush()
 happened are durable on the server. Anything else is a guess. I don't know
 the specifics, but that should be enough to work with (and saving off
 mutations shouldn't be too costly since they're stored serialized).


 On 8/22/14, 5:44 PM, Corey Nolet wrote:

 Thanks Josh,

 I understand about the session ID completely but the problem I have is
 that
 the exact same client code worked, line for line, just fine in 1.4.4 and
 it's acting up in 1.6.0. I also seem to remember the BatchWriter
 automatically creating a new session when one expired without an
 exception
 causing it to fail on the client.

 I know we've made changes since 1.4.4 but I'd like to troubleshoot the
 actual issue of the BatchWriter failing due to the thrift exception
 rather
 than just catching the exception and trying mutations again. The other
 issue is that I've already submitted a bunch of mutations to the batch
 writer from different threads. Does that mean I need to be storing them
 off
 twice? (once in the BatchWriter's cache and once in my own)

 The BatchWriter in my ingester is constantly sending data and the tablet
 servers have been given more than enough memory to be able to keep up.
 There's no swap being used and the network isn't experiencing any errors.


 On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com
 wrote:

  If you get an error from a BatchWriter, you pretty much have to throw
 away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
 If
 you want, you should be able to catch/recover from this without having
 to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

  Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built
 with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the
 ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out of 

Re: Tablet server thrift issue

2014-08-22 Thread Josh Elser
If you get an error from a BatchWriter, you pretty much have to throw 
away that instance of the BatchWriter and make a new one. See 
ACCUMULO-2990. If you want, you should be able to catch/recover from 
this without having to restart the ingester.


If the session ID is invalid, my guess is that it hasn't been used 
recently and the tserver cleaned it up. The exception logic isn't the 
greatest (as it just is presented to you as a RTE).


https://issues.apache.org/jira/browse/ACCUMULO-2990

On 8/22/14, 4:35 PM, Corey Nolet wrote:

Eric  Keith, Chris mentioned to me that you guys have seen this issue
before. Any ideas from anyone else are much appreciated as well.

I recently updated a project's dependencies to Accumulo 1.6.0 built with
Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
component which is running all the time with a batch writer using many
threads to push mutations into Accumulo.

The issue I'm having is a show stopper. At different intervals of time,
sometimes an hour, sometimes 30 minutes, I'm getting
MutationsRejectedExceptions (server errors) from the
TabletServerBatchWriter. Once they start, I need to restart the ingester to
get them to stop. They always come back within 30 minutes to an hour...
rinse, repeat.

The exception always happens on different tablet servers. It's a thrift
error saying a message was received out of sequence. In the TabletServer
logs, I see an Invalid session id exception which happens only once
before the client-side batch writer starts spitting out the MREs.

I'm running some heavyweight processing in Storm along side the tablet
servers. I shut that processing off in hopes that maybe it was the culprit
but that hasn't fixed the issue.

I'm surprised I haven't seen any other posts on the topic.

Thanks!



Re: Tablet server thrift issue

2014-08-22 Thread Corey Nolet
Thanks Josh,

I understand about the session ID completely but the problem I have is that
the exact same client code worked, line for line, just fine in 1.4.4 and
it's acting up in 1.6.0. I also seem to remember the BatchWriter
automatically creating a new session when one expired without an exception
causing it to fail on the client.

I know we've made changes since 1.4.4 but I'd like to troubleshoot the
actual issue of the BatchWriter failing due to the thrift exception rather
than just catching the exception and trying mutations again. The other
issue is that I've already submitted a bunch of mutations to the batch
writer from different threads. Does that mean I need to be storing them off
twice? (once in the BatchWriter's cache and once in my own)

The BatchWriter in my ingester is constantly sending data and the tablet
servers have been given more than enough memory to be able to keep up.
There's no swap being used and the network isn't experiencing any errors.


On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote:

 If you get an error from a BatchWriter, you pretty much have to throw away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
 you want, you should be able to catch/recover from this without having to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

 Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out of sequence. In the TabletServer
 logs, I see an Invalid session id exception which happens only once
 before the client-side batch writer starts spitting out the MREs.

 I'm running some heavyweight processing in Storm along side the tablet
 servers. I shut that processing off in hopes that maybe it was the culprit
 but that hasn't fixed the issue.

 I'm surprised I haven't seen any other posts on the topic.

 Thanks!




Re: Tablet server thrift issue

2014-08-22 Thread Josh Elser
Don't mean to tell you that I don't think there might be a 
bug/otherwise, that's pretty much just the limit of what I know about 
the server-side sessions :)


If you have concrete this worked in 1.4.4 and this happens instead 
with 1.6.0, that'd make a great ticket :D


The BatchWriter failure case is pretty rough, actually. Eric has made 
some changes to help already (in 1.6.1, I think), but it needs an 
overhaul that I haven't been able to make time to fix properly, either. 
IIRC, the only guarantee you have is that all mutations added before the 
last flush() happened are durable on the server. Anything else is a 
guess. I don't know the specifics, but that should be enough to work 
with (and saving off mutations shouldn't be too costly since they're 
stored serialized).


On 8/22/14, 5:44 PM, Corey Nolet wrote:

Thanks Josh,

I understand about the session ID completely but the problem I have is that
the exact same client code worked, line for line, just fine in 1.4.4 and
it's acting up in 1.6.0. I also seem to remember the BatchWriter
automatically creating a new session when one expired without an exception
causing it to fail on the client.

I know we've made changes since 1.4.4 but I'd like to troubleshoot the
actual issue of the BatchWriter failing due to the thrift exception rather
than just catching the exception and trying mutations again. The other
issue is that I've already submitted a bunch of mutations to the batch
writer from different threads. Does that mean I need to be storing them off
twice? (once in the BatchWriter's cache and once in my own)

The BatchWriter in my ingester is constantly sending data and the tablet
servers have been given more than enough memory to be able to keep up.
There's no swap being used and the network isn't experiencing any errors.


On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote:


If you get an error from a BatchWriter, you pretty much have to throw away
that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
you want, you should be able to catch/recover from this without having to
restart the ingester.

If the session ID is invalid, my guess is that it hasn't been used
recently and the tserver cleaned it up. The exception logic isn't the
greatest (as it just is presented to you as a RTE).

https://issues.apache.org/jira/browse/ACCUMULO-2990


On 8/22/14, 4:35 PM, Corey Nolet wrote:


Eric  Keith, Chris mentioned to me that you guys have seen this issue
before. Any ideas from anyone else are much appreciated as well.

I recently updated a project's dependencies to Accumulo 1.6.0 built with
Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
component which is running all the time with a batch writer using many
threads to push mutations into Accumulo.

The issue I'm having is a show stopper. At different intervals of time,
sometimes an hour, sometimes 30 minutes, I'm getting
MutationsRejectedExceptions (server errors) from the
TabletServerBatchWriter. Once they start, I need to restart the ingester
to
get them to stop. They always come back within 30 minutes to an hour...
rinse, repeat.

The exception always happens on different tablet servers. It's a thrift
error saying a message was received out of sequence. In the TabletServer
logs, I see an Invalid session id exception which happens only once
before the client-side batch writer starts spitting out the MREs.

I'm running some heavyweight processing in Storm along side the tablet
servers. I shut that processing off in hopes that maybe it was the culprit
but that hasn't fixed the issue.

I'm surprised I haven't seen any other posts on the topic.

Thanks!






Re: Tablet server thrift issue

2014-08-22 Thread Corey Nolet
Josh,

Your advice is definitely useful- I also thought about catching the
exception and retrying with a fresh batch writer but the fact that the
batch writer failure doesn't go away without being re-instantiated is
really only a nuisance. The TabletServerBatchWriter could be designed much
better, I agree, but that is not the root of the problem.

The Thrift exception that is causing the issue is what I'd like to get to
the bottom of. It's throwing the following:

*TApplicationException: applyUpdates failed: out of sequence response *

I've never seen this exception before in regular use of the client API- but
I also just updated to 1.6.0. Google isn't showing anything useful for how
exactly this exception could come about other than using a bad threading
model- and I don't see any drastic changes or other user complaints on the
mailing list that would validate that line of thought. Quite frankly, I'm
stumped. This could be a Thrift exception related to a Thrift bug or
something bad on my system and have nothing to do with Accumulo.

Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
seen the exception before and may remember what it was/how they fixed it.


On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote:

 Don't mean to tell you that I don't think there might be a bug/otherwise,
 that's pretty much just the limit of what I know about the server-side
 sessions :)

 If you have concrete this worked in 1.4.4 and this happens instead with
 1.6.0, that'd make a great ticket :D

 The BatchWriter failure case is pretty rough, actually. Eric has made some
 changes to help already (in 1.6.1, I think), but it needs an overhaul that
 I haven't been able to make time to fix properly, either. IIRC, the only
 guarantee you have is that all mutations added before the last flush()
 happened are durable on the server. Anything else is a guess. I don't know
 the specifics, but that should be enough to work with (and saving off
 mutations shouldn't be too costly since they're stored serialized).


 On 8/22/14, 5:44 PM, Corey Nolet wrote:

 Thanks Josh,

 I understand about the session ID completely but the problem I have is
 that
 the exact same client code worked, line for line, just fine in 1.4.4 and
 it's acting up in 1.6.0. I also seem to remember the BatchWriter
 automatically creating a new session when one expired without an exception
 causing it to fail on the client.

 I know we've made changes since 1.4.4 but I'd like to troubleshoot the
 actual issue of the BatchWriter failing due to the thrift exception rather
 than just catching the exception and trying mutations again. The other
 issue is that I've already submitted a bunch of mutations to the batch
 writer from different threads. Does that mean I need to be storing them
 off
 twice? (once in the BatchWriter's cache and once in my own)

 The BatchWriter in my ingester is constantly sending data and the tablet
 servers have been given more than enough memory to be able to keep up.
 There's no swap being used and the network isn't experiencing any errors.


 On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote:

  If you get an error from a BatchWriter, you pretty much have to throw
 away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
 If
 you want, you should be able to catch/recover from this without having to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

  Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out of sequence. In the TabletServer
 logs, I see an Invalid session id exception which happens only once
 before the client-side batch writer starts spitting out the MREs.

 I'm running some heavyweight processing in Storm along side the tablet
 servers. I shut that processing off in hopes that maybe it was the
 culprit
 but that hasn't fixed the issue.

 I'm surprised I haven't seen any other posts on the topic.