Some auxiliary info that I just tried, right now the query is of the form "select id from table where column = x order by foo". If I change this to "select id, '123' from table where column = x order by foo", then it completes every time... Does this provide any clue into what is going on? Just adding a numeric constant into the select fixes it?! Does not seem to hang up at all on any blocks when we do this.
Thanks, Dan On Fri, Jul 9, 2010 at 5:29 PM, Dan <dkele...@gmail.com> wrote: > Hello, > Thank you very much for your observations, this definitely seems to be > narrowing in on the problem. I have tried increasing (decreasing actually) > the RowCacheSize and it seems to start to flake out around -4425. At -4424 > it hangs up for 10 seconds or so on that block, but seems to complete every > time. At -4425 it hangs up on that block the first two times I ran it, but > then the next 5 times it only hung up there for ~10 seconds then finishes > up. Then I increased to -4430 and also had intermittent results. -4500 > fails every time on that same block (for 5 runs so far). > > Can you suggest anything I can look into on the network side or the OS > side as a possible workaround for this without modifying that parameter in > the code? The MTU on the boxes is currently 1500, not sure if that is > related in any way. > > > Thanks, > Dan > > > On Fri, Jul 9, 2010 at 3:30 PM, John Scoles <byter...@hotmail.com> wrote: > >> >> ok we are getting some place now >> >> >> >> in DBD::Oracle 1.19 the way it set the rowcache is much different that >> later versions >> >> >> >> in this case -1 would mean that the following is applied >> >> >> >> cache_mem = (cache_mem_iv <= 0) ? 10 * 1460 : cache_mem_iv; >> >> >> >> >> >> so you cache_mem is now set to 10*1460 bytes for any describes that you >> do (selects) >> >> >> >> later on this will be changed when you do a fetch 100,000 rows as your >> size is -1. >> >> >> >> So you might want to try a larger - number for the value of RowCacheSize >> to see at what point the querry dies again >> >> >> >> so far as being a bug I am on the fence right now. >> >> >> >> The ban-width of you network may have changed or the size of the bit >> stream (16 vs 32 bit bus) so the default rowcachesize may no longer be large >> engough >> >> >> >> The value of rowcachesize when there is no value given for rowcache size >> is calulated like this >> >> >> >> (from the code) >> >> >> >> /* Oracle packets on ethernet have max size of around 1460. */ >> /* We'll aim to fill our row cache with around 10 per go. */ >> /* Using 10 means any 'runt' packets will have less impact. */ >> >> >> int txfr_size = 10 * 1460; /* desired transfer/cache size */ >> >> >> >> /* Use guessed average on-the-wire row width calculated above & */ >> /* add in overhead of 5 bytes per field plus 8 bytes per row. */ >> /* The n*5+8 was determined by studying SQL*Net v2 packets. */ >> /* It could probably benefit from a more detailed analysis. */ >> >> >> est_width += num_fields*5 + 8; >> >> >> >> cache_rows = txfr_size / est_width; /* (maybe 1 or 0) */ >> >> >> >> /* To ensure good performance with large rows (near or larger */ >> /* than our target transfer size) we set a minimum cache size. */ >> >> >> if (cache_rows < 6) /* is cache a 'useful' size? */ >> cache_rows = (cache_rows > 0) ? 6 : 4; >> } >> if (cache_rows > 32767) /* keep within Oracle's limits */ >> cache_rows = 32767; >> >> >> >> >> >> so the above might be buggering you up if your packet size is no longer >> 1460? >> >> >> >> This might be it. >> >> >> >> Cheers >> >> John Scoles >> > Date: Fri, 9 Jul 2010 10:15:03 -0400 >> > Subject: Re: "sql*net message from client" hang >> > From: dkele...@gmail.com >> > To: sco...@pythian.com >> > CC: martin.h...@oracle.com; dbi-users@perl.org >> > >> > Hello, >> > >> > Apologies, that was DBI 1.607, not .67! >> > >> > with RowCacheSize '0', it hangs on the same block, with '1', it seems to >> > go through ok as with '-1'. >> > >> > As far as changes go, that is part of the mystery, haven't been able to >> > find any changes around the time when this started to happen. Yes, we >> are >> > running DBI 1.607, DBD::Oracle 1.19, Oracle Client 10.2.0.1.0, all >> pretty >> > old, but I have confirmed that nothing changed in them anywhere, all the >> > timestamps are years old, and the issue happens across all of our >> > application servers. General hunch is that it is something with the >> > network, but I have tried it on boxes with no firewall, where a >> traceroute >> > to the DB server has no hops at all, and the issue still occurs. >> > >> > Thanks, >> > Dan >> > >> > On Thu, Jul 8, 2010 at 7:24 PM, John Scoles <sco...@pythian.com> wrote: >> > >> > > The $dbh->{RowCacheSize} = -1; is funny as you are telling the box to >> only >> > > use 1 byte for RowCach >> > > Try it with '0' and see what happens? >> > > >> > > >> > > >> > > On Thu, Jul 8, 2010 at 7:20 PM, John Scoles <sco...@pythian.com> >> wrote: >> > > >> > >> Hmm DBI 0.67 that is old and the fact you get the same error with the >> > >> latest DBI and DBD::Oracle would lead me to think there may be >> something >> > >> wrong internally in your DB or its storage medium, corrupt block >> perhaps or >> > >> bad sector?? >> > >> >> > >> The code differences between 1.61 and 0.67 are vast as would be the >> > >> changes in DBD::Oracle. >> > >> >> > >> Did you change your Oracle client lately or anything else with the DB >> or >> > >> network? >> > >> >> > >> Would lke to know the version of DBD:Oracle you are using with DBI >> 0.67 >> > >> and what version of the ORACLE client you where using and what >> version of >> > >> the DB you are trying to select from >> > >> >> > >> that will help narrow things down a bit >> > >> >> > >> cheers >> > >> John Scoles >> > >> >> > >> >> > >> On Thu, Jul 8, 2010 at 5:28 PM, Dan <dkele...@gmail.com> wrote: >> > >> >> > >>> Hello, >> > >>> We do not have any quotas implemented on our Oracle instance. The >> > >>> simple >> > >>> select works fine for that hanging block on sqlplus, or if i only >> execute >> > >>> it >> > >>> for that block in the perl script instead of iterating through them >> all. >> > >>> I >> > >>> just tested running for that same block 185 times instead of all 185 >> > >>> blocks, >> > >>> and it seems to hang on the second iteration (three times in a row >> so >> > >>> far) >> > >>> >> > >>> Thanks, >> > >>> Dan >> > >>> >> > >>> On Thu, Jul 8, 2010 at 12:29 PM, Martin Hall < >> martin.h...@oracle.com> >> > >>> wrote: >> > >>> >> > >>> > there are various quotas in Oracle that could stop a query. >> However, >> > >>> they >> > >>> > do tend to kill the guilty session rather than just hang like >> that. If >> > >>> your >> > >>> > simple select ran multiple times for the same 10k block, would it >> still >> > >>> > hang? >> > >>> > >> > >>> > Cheers >> > >>> > >> > >>> > Martin >> > >>> > >> > >>> > >> > >>> > On 08/07/2010 17:07, Howard, Chris wrote: >> > >>> > >> > >>> > If it always stops at the same place, it makes me think >> > >>> > of a resource problem, something like a quota? >> > >>> > >> > >>> > I don't remember of there are select quotas. >> > >>> > >> > >>> > Does it do the same if run as sysdba or some other >> > >>> > well-endowed database user? >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > -----Original Message----- >> > >>> > From: Dan [mailto:dkele...@gmail.com <dkele...@gmail.com>] >> > >>> > Sent: Thursday, July 08, 2010 10:02 AM >> > >>> > To: dbi-users@perl.org >> > >>> > Subject: "sql*net message from client" hang >> > >>> > >> > >>> > Hello, >> > >>> > I have some code that has been running happily for many years, but >> > >>> > suddenly started hanging on a simple select query to our Oracle >> DB. >> > >>> > The DB >> > >>> > shows it as an inactive session with "SQL*NET message from >> client", so >> > >>> > the >> > >>> > DB thinks it is waiting for the perl script, but the perl script >> is >> > >>> > hung >> > >>> > reading from the DB according to truss. I boiled it down to a 20 >> line >> > >>> > test >> > >>> > script that prepares a query using DBI and then executes it ~190 >> times >> > >>> > for >> > >>> > different 10k blocks of the table. It seems to consistently hang >> on >> > >>> > the >> > >>> > 97th iteration, and times out after a few hours hung in "SQL*NET >> > >>> > message >> > >>> > from client". Has anyone ever come across an issue like this? Even >> > >>> > stranger, >> > >>> > when I inserted "$dbh->{RowCacheSize} = -1;" , it seems to >> complete >> > >>> > >> > >>> > >> > >>> > and >> > >>> > >> > >>> > >> > >>> > never hang... Our hunch is that this is some issue with the >> network, >> > >>> > but >> > >>> > don't have the expertise to pinpoint what would cause this strange >> > >>> > behaviour >> > >>> > >> > >>> > We are running old DBI .67, but i tried the same script with a >> local >> > >>> > version >> > >>> > of the latest and greatest DBI and DBD::Oracle with the same >> results. >> > >>> > Here >> > >>> > is a sample output from setting $dbh->trace(15); : >> > >>> > >> > >>> > 96 : 1793727 >> > >>> > -> execute for DBD::Oracle::st (DBI::st=HASH(0x40ad7c)~0x40ad4c >> > >>> > '1793727') thr#12d600 >> > >>> > bind :p1 <== '1793727' (type 0) >> > >>> > rebinding :p1 (not-utf8, ftype 1, csid 0, csform 0, inout 0) >> > >>> > bind :p1 <== '1793727' (size 7/8/0, ptype 4, otype 1) >> > >>> > bind :p1 <== '1793727' (size 7/7, otype 1, indp 0, at_exec 1) >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> OCIBindByName(3960a0,404734,38f52c,":p1",3,3a8438,7,1,40474c,0,404744,0 >> > >>> > >> > >>> > >> > >>> > ,0,2)=SUCCESS >> > >>> > >> > >>> > >> OCIBindDynamic(395e34,38f52c,404710,fefec350,404710,fefec644)=SUCCESS >> > >>> > bind :p1 <== '1793727' (in, not-utf8, csid 31->0->31, ftype 1, >> > >>> > csform >> > >>> > 0->0, maxlen 7, maxdata_size 0) >> > >>> > OCIAttrSet(395e34,OCI_HTYPE_BIND,ffbff290,0,31,38f52c)=SUCCESS >> > >>> > dbd_st_execute SELECT (out0, lob0)... >> > >>> > in ':p1' [0,0]: len 7, ind 0 >> > >>> > OCIStmtExecute(38f4b8,3960a0,38f52c,0,0,0,0,0)=SUCCESS >> > >>> > OCIAttrGet(3960a0,OCI_HTYPE_STMT,ffbff40a,0,10,38f52c)=SUCCESS >> > >>> > dbd_st_execute SELECT returned (SUCCESS, rpc0, fn4, out0) >> > >>> > <- execute= '0E0' at simple.pl line 18 >> > >>> > 97 : 1793738 >> > >>> > -> execute for DBD::Oracle::st (DBI::st=HASH(0x40ad7c)~0x40ad4c >> > >>> > '1793738') thr#12d600 >> > >>> > bind :p1 <== '1793738' (type 0) >> > >>> > rebinding :p1 (not-utf8, ftype 1, csid 0, csform 0, inout 0) >> > >>> > bind :p1 <== '1793738' (size 7/8/0, ptype 4, otype 1) >> > >>> > bind :p1 <== '1793738' (size 7/7, otype 1, indp 0, at_exec 1) >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> OCIBindByName(3960a0,404734,38f52c,":p1",3,3a8438,7,1,40474c,0,404744,0 >> > >>> > >> > >>> > >> > >>> > ,0,2)=SUCCESS >> > >>> > >> > >>> > >> OCIBindDynamic(395e34,38f52c,404710,fefec350,404710,fefec644)=SUCCESS >> > >>> > bind :p1 <== '1793738' (in, not-utf8, csid 31->0->31, ftype 1, >> > >>> > csform >> > >>> > 0->0, maxlen 7, maxdata_size 0) >> > >>> > OCIAttrSet(395e34,OCI_HTYPE_BIND,ffbff290,0,31,38f52c)=SUCCESS >> > >>> > dbd_st_execute SELECT (out0, lob0)... >> > >>> > in ':p1' [0,0]: len 7, ind 0 >> > >>> > ... >> > >>> > ### HANGS forever here :( >> > >>> > ... >> > >>> > >> > >>> > Any help is much appreciated! >> > >>> > >> > >>> > Thanks, >> > >>> > Dan >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> >> > >> >> > >> >> > > -- >> > > New! Learn why & how to love your data with Pythian's new webinar >> series. >> > > Topics, details & register: http://www.pythian.com/webinars >> > > >> > > >> >> _________________________________________________________________ >> Turn down-time into play-time with Messenger games >> http://go.microsoft.com/?linkid=9734385 > > >