Re: "sql*net message from client" hang

Dan Fri, 09 Jul 2010 14:47:42 -0700

Some auxiliary info that I just tried, right now the query is of the form
"select id from table where column = x order by foo".  If I change this to
"select id, '123' from table where column = x order by foo", then it
completes every time...  Does this provide any clue into what is going on?
 Just adding a numeric constant into the select fixes it?!  Does not seem to
hang up at all on any blocks when we do this.


Thanks,
Dan

On Fri, Jul 9, 2010 at 5:29 PM, Dan <dkele...@gmail.com> wrote:

> Hello,
>   Thank you very much for your observations, this definitely seems to be
> narrowing in on the problem.  I have tried increasing (decreasing actually)
> the RowCacheSize and it seems to start to flake out around -4425.  At -4424
> it hangs up for 10 seconds or so on that block, but seems to complete every
> time.  At -4425  it hangs up on that block the first two times I ran it, but
> then the next 5 times it only hung up there for ~10 seconds then finishes
> up.  Then I increased to -4430 and also had intermittent results.  -4500
> fails every time on that same block (for 5 runs so far).
>
>   Can you suggest anything I can look into on the network side or the OS
> side as a possible workaround for this without modifying that parameter in
> the code?  The MTU on the boxes is currently 1500, not sure if that is
> related in any way.
>
>
> Thanks,
> Dan
>
>
> On Fri, Jul 9, 2010 at 3:30 PM, John Scoles <byter...@hotmail.com> wrote:
>
>>
>> ok we are getting some place now
>>
>>
>>
>> in DBD::Oracle 1.19 the way it set the rowcache is much different that
>> later versions
>>
>>
>>
>> in this case -1 would mean that the following is applied
>>
>>
>>
>> cache_mem = (cache_mem_iv <= 0) ? 10 * 1460 : cache_mem_iv;
>>
>>
>>
>>
>>
>> so you cache_mem is now set to 10*1460 bytes  for any describes that you
>> do (selects)
>>
>>
>>
>> later on this will be changed when you do a fetch 100,000 rows as your
>> size is -1.
>>
>>
>>
>> So you might want to try a larger - number for the value of RowCacheSize
>>  to see at what point the querry dies again
>>
>>
>>
>> so far as being a bug I am on the fence right now.
>>
>>
>>
>> The ban-width of you network may have changed or the size of the bit
>> stream (16 vs 32 bit bus) so the default rowcachesize may no longer be large
>> engough
>>
>>
>>
>> The value of rowcachesize when there is no value given for rowcache size
>> is calulated like this
>>
>>
>>
>> (from the code)
>>
>>
>>
>> /* Oracle packets on ethernet have max size of around 1460. */
>>  /* We'll aim to fill our row cache with around 10 per go. */
>>  /* Using 10 means any 'runt' packets will have less impact. */
>>
>>
>>  int txfr_size  = 10 * 1460; /* desired transfer/cache size */
>>
>>
>>
>>  /* Use guessed average on-the-wire row width calculated above & */
>>  /* add in overhead of 5 bytes per field plus 8 bytes per row. */
>>  /* The n*5+8 was determined by studying SQL*Net v2 packets. */
>>  /* It could probably benefit from a more detailed analysis. */
>>
>>
>>  est_width += num_fields*5 + 8;
>>
>>
>>
>>  cache_rows = txfr_size / est_width;       /* (maybe 1 or 0) */
>>
>>
>>
>>  /* To ensure good performance with large rows (near or larger */
>>  /* than our target transfer size) we set a minimum cache size. */
>>
>>
>>  if (cache_rows < 6) /* is cache a 'useful' size? */
>>     cache_rows = (cache_rows > 0) ? 6 : 4;
>>    }
>>    if (cache_rows > 32767) /* keep within Oracle's limits  */
>>  cache_rows = 32767;
>>
>>
>>
>>
>>
>> so the above might be buggering you up if your packet size is no longer
>> 1460?
>>
>>
>>
>> This might be it.
>>
>>
>>
>> Cheers
>>
>> John Scoles
>> > Date: Fri, 9 Jul 2010 10:15:03 -0400
>> > Subject: Re: "sql*net message from client" hang
>> > From: dkele...@gmail.com
>> > To: sco...@pythian.com
>> > CC: martin.h...@oracle.com; dbi-users@perl.org
>> >
>> > Hello,
>> >
>> > Apologies, that was DBI 1.607, not .67!
>> >
>> > with RowCacheSize '0', it hangs on the same block, with '1', it seems to
>> > go through ok as with '-1'.
>> >
>> > As far as changes go, that is part of the mystery, haven't been able to
>> > find any changes around the time when this started to happen. Yes, we
>> are
>> > running DBI 1.607, DBD::Oracle 1.19, Oracle Client 10.2.0.1.0, all
>> pretty
>> > old, but I have confirmed that nothing changed in them anywhere, all the
>> > timestamps are years old, and the issue happens across all of our
>> > application servers. General hunch is that it is something with the
>> > network, but I have tried it on boxes with no firewall, where a
>> traceroute
>> > to the DB server has no hops at all, and the issue still occurs.
>> >
>> > Thanks,
>> > Dan
>> >
>> > On Thu, Jul 8, 2010 at 7:24 PM, John Scoles <sco...@pythian.com> wrote:
>> >
>> > > The $dbh->{RowCacheSize} = -1; is funny as you are telling the box to
>> only
>> > > use 1 byte for RowCach
>> > > Try it with '0' and see what happens?
>> > >
>> > >
>> > >
>> > > On Thu, Jul 8, 2010 at 7:20 PM, John Scoles <sco...@pythian.com>
>> wrote:
>> > >
>> > >> Hmm DBI 0.67 that is old and the fact you get the same error with the
>> > >> latest DBI and DBD::Oracle would lead me to think there may be
>> something
>> > >> wrong internally in your DB or its storage medium, corrupt block
>> perhaps or
>> > >> bad sector??
>> > >>
>> > >> The code differences between 1.61 and 0.67 are vast as would be the
>> > >> changes in DBD::Oracle.
>> > >>
>> > >> Did you change your Oracle client lately or anything else with the DB
>> or
>> > >> network?
>> > >>
>> > >> Would lke to know the version of DBD:Oracle you are using with DBI
>> 0.67
>> > >> and what version of the ORACLE client you where using and what
>> version of
>> > >> the DB you are trying to select from
>> > >>
>> > >> that will help narrow things down a bit
>> > >>
>> > >> cheers
>> > >> John Scoles
>> > >>
>> > >>
>> > >> On Thu, Jul 8, 2010 at 5:28 PM, Dan <dkele...@gmail.com> wrote:
>> > >>
>> > >>> Hello,
>> > >>> We do not have any quotas implemented on our Oracle instance. The
>> > >>> simple
>> > >>> select works fine for that hanging block on sqlplus, or if i only
>> execute
>> > >>> it
>> > >>> for that block in the perl script instead of iterating through them
>> all.
>> > >>> I
>> > >>> just tested running for that same block 185 times instead of all 185
>> > >>> blocks,
>> > >>> and it seems to hang on the second iteration (three times in a row
>> so
>> > >>> far)
>> > >>>
>> > >>> Thanks,
>> > >>> Dan
>> > >>>
>> > >>> On Thu, Jul 8, 2010 at 12:29 PM, Martin Hall <
>> martin.h...@oracle.com>
>> > >>> wrote:
>> > >>>
>> > >>> > there are various quotas in Oracle that could stop a query.
>> However,
>> > >>> they
>> > >>> > do tend to kill the guilty session rather than just hang like
>> that. If
>> > >>> your
>> > >>> > simple select ran multiple times for the same 10k block, would it
>> still
>> > >>> > hang?
>> > >>> >
>> > >>> > Cheers
>> > >>> >
>> > >>> > Martin
>> > >>> >
>> > >>> >
>> > >>> > On 08/07/2010 17:07, Howard, Chris wrote:
>> > >>> >
>> > >>> > If it always stops at the same place, it makes me think
>> > >>> > of a resource problem, something like a quota?
>> > >>> >
>> > >>> > I don't remember of there are select quotas.
>> > >>> >
>> > >>> > Does it do the same if run as sysdba or some other
>> > >>> > well-endowed database user?
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > -----Original Message-----
>> > >>> > From: Dan [mailto:dkele...@gmail.com <dkele...@gmail.com>]
>> > >>> > Sent: Thursday, July 08, 2010 10:02 AM
>> > >>> > To: dbi-users@perl.org
>> > >>> > Subject: "sql*net message from client" hang
>> > >>> >
>> > >>> > Hello,
>> > >>> > I have some code that has been running happily for many years, but
>> > >>> > suddenly started hanging on a simple select query to our Oracle
>> DB.
>> > >>> > The DB
>> > >>> > shows it as an inactive session with "SQL*NET message from
>> client", so
>> > >>> > the
>> > >>> > DB thinks it is waiting for the perl script, but the perl script
>> is
>> > >>> > hung
>> > >>> > reading from the DB according to truss. I boiled it down to a 20
>> line
>> > >>> > test
>> > >>> > script that prepares a query using DBI and then executes it ~190
>> times
>> > >>> > for
>> > >>> > different 10k blocks of the table. It seems to consistently hang
>> on
>> > >>> > the
>> > >>> > 97th iteration, and times out after a few hours hung in "SQL*NET
>> > >>> > message
>> > >>> > from client". Has anyone ever come across an issue like this? Even
>> > >>> > stranger,
>> > >>> > when I inserted "$dbh->{RowCacheSize} = -1;" , it seems to
>> complete
>> > >>> >
>> > >>> >
>> > >>> > and
>> > >>> >
>> > >>> >
>> > >>> > never hang... Our hunch is that this is some issue with the
>> network,
>> > >>> > but
>> > >>> > don't have the expertise to pinpoint what would cause this strange
>> > >>> > behaviour
>> > >>> >
>> > >>> > We are running old DBI .67, but i tried the same script with a
>> local
>> > >>> > version
>> > >>> > of the latest and greatest DBI and DBD::Oracle with the same
>> results.
>> > >>> > Here
>> > >>> > is a sample output from setting $dbh->trace(15); :
>> > >>> >
>> > >>> > 96 : 1793727
>> > >>> > -> execute for DBD::Oracle::st (DBI::st=HASH(0x40ad7c)~0x40ad4c
>> > >>> > '1793727') thr#12d600
>> > >>> > bind :p1 <== '1793727' (type 0)
>> > >>> > rebinding :p1 (not-utf8, ftype 1, csid 0, csform 0, inout 0)
>> > >>> > bind :p1 <== '1793727' (size 7/8/0, ptype 4, otype 1)
>> > >>> > bind :p1 <== '1793727' (size 7/7, otype 1, indp 0, at_exec 1)
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> OCIBindByName(3960a0,404734,38f52c,":p1",3,3a8438,7,1,40474c,0,404744,0
>> > >>> >
>> > >>> >
>> > >>> > ,0,2)=SUCCESS
>> > >>> >
>> > >>> >
>> OCIBindDynamic(395e34,38f52c,404710,fefec350,404710,fefec644)=SUCCESS
>> > >>> > bind :p1 <== '1793727' (in, not-utf8, csid 31->0->31, ftype 1,
>> > >>> > csform
>> > >>> > 0->0, maxlen 7, maxdata_size 0)
>> > >>> > OCIAttrSet(395e34,OCI_HTYPE_BIND,ffbff290,0,31,38f52c)=SUCCESS
>> > >>> > dbd_st_execute SELECT (out0, lob0)...
>> > >>> > in ':p1' [0,0]: len 7, ind 0
>> > >>> > OCIStmtExecute(38f4b8,3960a0,38f52c,0,0,0,0,0)=SUCCESS
>> > >>> > OCIAttrGet(3960a0,OCI_HTYPE_STMT,ffbff40a,0,10,38f52c)=SUCCESS
>> > >>> > dbd_st_execute SELECT returned (SUCCESS, rpc0, fn4, out0)
>> > >>> > <- execute= '0E0' at simple.pl line 18
>> > >>> > 97 : 1793738
>> > >>> > -> execute for DBD::Oracle::st (DBI::st=HASH(0x40ad7c)~0x40ad4c
>> > >>> > '1793738') thr#12d600
>> > >>> > bind :p1 <== '1793738' (type 0)
>> > >>> > rebinding :p1 (not-utf8, ftype 1, csid 0, csform 0, inout 0)
>> > >>> > bind :p1 <== '1793738' (size 7/8/0, ptype 4, otype 1)
>> > >>> > bind :p1 <== '1793738' (size 7/7, otype 1, indp 0, at_exec 1)
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> OCIBindByName(3960a0,404734,38f52c,":p1",3,3a8438,7,1,40474c,0,404744,0
>> > >>> >
>> > >>> >
>> > >>> > ,0,2)=SUCCESS
>> > >>> >
>> > >>> >
>> OCIBindDynamic(395e34,38f52c,404710,fefec350,404710,fefec644)=SUCCESS
>> > >>> > bind :p1 <== '1793738' (in, not-utf8, csid 31->0->31, ftype 1,
>> > >>> > csform
>> > >>> > 0->0, maxlen 7, maxdata_size 0)
>> > >>> > OCIAttrSet(395e34,OCI_HTYPE_BIND,ffbff290,0,31,38f52c)=SUCCESS
>> > >>> > dbd_st_execute SELECT (out0, lob0)...
>> > >>> > in ':p1' [0,0]: len 7, ind 0
>> > >>> > ...
>> > >>> > ### HANGS forever here :(
>> > >>> > ...
>> > >>> >
>> > >>> > Any help is much appreciated!
>> > >>> >
>> > >>> > Thanks,
>> > >>> > Dan
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > > --
>> > > New! Learn why & how to love your data with Pythian's new webinar
>> series.
>> > > Topics, details & register: http://www.pythian.com/webinars
>> > >
>> > >
>>
>> _________________________________________________________________
>> Turn down-time into play-time with Messenger games
>> http://go.microsoft.com/?linkid=9734385
>
>
>

Re: "sql*net message from client" hang

Reply via email to