Thanks for the response!
Yes, in fact, we are almost always submitting multiple jobs one right
after the other, so it's likely that the jobs submitted first are
locking a table for writes and that's causing the problem. Though this
problem I am seeing did not happen with 3.0.3.
And I've been putting off having to migrate, but you'd advice about
postgres or innodb is well taken. We'll probably go that route after
our current month's full run is complete.
If we wind up having to stick with this for longer, I'll try to gather
the mysql processlist info.
thanks again,
Stephen
On 04/23/2010 12:23 PM, Eric Bollengier wrote:
> Hello,
>
> At this point, a mysqladmin processlist is much more useful than
> a backtrace. Anyway, it doesn't look like to be a Bacula problem, but a MySQL
> MyISAM limitation.
>
> It's a bit hard to read the backtrace to analyse database contention, but it
> looks like the Accurate query uses the Job table (for reading) and an other
> Job is trying to update his job record at the same time. For that, this sql
> session needs an exclusive lock, and should wait that the Accurate query ends.
> The console status thread is waiting for the job who wants to update the job
> table finish. The console command should not block if no other job are trying
> to update their status.
>
> Depending on how MySQL manages the lock queue, I'm not an expert, it should
> permits to have two read session at the same time, but I'm not sure it will
> allow the second one if a writer wants to lock the table. (It depends if you
> want to speed up reads or writes). So a dedicated connection won't help in
> this configuration.
>
> Switching to something else than MyISAM should resolve the problem. If you
> choose PostgreSQL, you will be able to be 10 time faster in the Accurate
> query, and Jobs won't be blocked to update their record.
>
> If you can confirm this with a mysqladmin processlist, I will update the
> manual section on how to choose the SQL backend.
>
> The next version of MySQL uses Innodb as default, IMHO, I think it's time to
> drop MyISAM, specially if you have 1.4 billion rows in the File table, the
> recovery time after a problem will block your production during hours, and you
> have absolutely no guarantee to get your data back in a consistent state...
>
> Bye
>
> Le vendredi 23 avril 2010 20:16:31, Stephen Thompson a écrit :
>> Hello,
>>
>> As I mentioned, I am having the same problem the original poster
>> mentioned even though he later recanted.
>>
>> I run all my jobs using the Accurate mode and on particularly large
>> (possibly number of files rather than size) jobs, the Status of the
>> Director will block bconsole until the database activity to start the
>> Accurate job completes.
>>
>> Output of bconsole:
>> ---------------------------------------------------------------------------
>> ------------- *s
>> Status available for:
>> 1: Director
>> 2: Storage
>> 3: Client
>> 4: All
>> Select daemon type for status (1-4): 1
>> lawson-dir Version: 5.0.1 (24 February 2010) x86_64-unknown-linux-gnu
>> redhat
>> Daemon started 23-Apr-10 10:54, 0 Jobs run since started.
>> Heap: heap=1,200,128 smbytes=954,678 max_bytes=954,934 bufs=6,590
>> max_bufs=6,611
>>
>> Scheduled Jobs:
>> Level Type Pri Scheduled Name Volume
>> ===========================================================================
>> ========
>>
>> ---------------------------------------------------------------------------
>> --------------- (i.e. nothing more, no prompt, until database queries
>> finish for
>> Accurate job)
>>
>>
>> Here is a traceback run seconds after the above bconsole state:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 0x2ad78108c280 (LWP 18020)]
>> [New Thread 0x4b1ec940 (LWP 20980)]
>> [New Thread 0x4a7eb940 (LWP 20977)]
>> [New Thread 0x493e9940 (LWP 20971)]
>> [New Thread 0x47fe7940 (LWP 20967)]
>> [New Thread 0x46be5940 (LWP 20963)]
>> [New Thread 0x457e3940 (LWP 20961)]
>> [New Thread 0x44de2940 (LWP 20957)]
>> [New Thread 0x439e0940 (LWP 20953)]
>> [New Thread 0x4186f940 (LWP 18137)]
>> [New Thread 0x42fdf940 (LWP 18026)]
>> [New Thread 0x425de940 (LWP 18025)]
>> 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> $1 = "lawson-dir", '\0'<repeats 19 times>
>> $2 = 0x1f5e9528 "bacula-dir"
>> $3 = 0x1f5e9568 "/opt/bacula/bin/"
>> $4 = 0x1f6e6148 "MySQL"
>> $5 = 0x2ad780c5f76e "5.0.1 (24 February 2010)"
>> $6 = 0x2ad780c5f787 "x86_64-unknown-linux-gnu"
>> $7 = 0x2ad780c5f7a0 "redhat"
>> $8 = 0x2ad780c5f465 ""
>> $9 = "lawson.geo.berkeley.edu", '\0'<repeats 26 times>
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=60, usec=0) at bsys.c:61
>> #2 0x000000000042cd7a in wait_for_next_job (
>> one_shot_job_to_run=<value optimized out>) at scheduler.c:131
>> #3 0x000000000040d70c in main (argc=0, argv=0x7fff39454778) at dird.c:338
>>
>> Thread 12 (Thread 0x425de940 (LWP 18025)):
>> #0 0x000000374b0cced2 in select () from /lib64/libc.so.6
>> #1 0x00002ad780c32fd7 in bnet_thread_server (addrs=0x1f5eb3a8,
>> max_clients=20, client_wq=0x66f8e0,
>> handle_client_request=0x444a90<handle_UA_client_request>)
>> at bnet_server.c:161
>> #2 0x0000000000444a81 in connect_thread (arg=0x1f5eb3a8) at ua_server.c:82
>> #3 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #4 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 11 (Thread 0x42fdf940 (LWP 18026)):
>> #0 0x000000374b80af70 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> from /lib64/libpthread.so.0
>> #1 0x00002ad780c57282 in watchdog_thread (arg=<value optimized out>)
>> at watchdog.c:308
>> #2 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #3 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 10 (Thread 0x4186f940 (LWP 18137)):
>> #0 0x000000374b80ad09 in pthread_cond_wait@@GLIBC_2.3.2 ()
>> from /lib64/libpthread.so.0
>> #1 0x00002ad780c4c956 in rwl_writelock (rwl=0x1f6eb3d8) at rwlock.c:242
>> #2 0x00002ad7805fc837 in _db_lock (file=0x2ad78060e084 "sql_get.c",
>> line=128,
>> mdb=0x15) at sql.c:385
>> #3 0x00002ad780603192 in db_get_pool_record (jcr=0x1f6ffec8, mdb=0x15,
>> pdbr=0x4186e9c0) at sql_get.c:607
>> #4 0x0000000000437293 in complete_jcr_for_job (jcr=0x1f6ffec8,
>> job=0x1f5ff728, pool=0x1f68ad58) at ua_output.c:725
>> #5 0x0000000000446277 in list_scheduled_jobs (ua=0x1f701a18)
>> at ua_status.c:464
>> #6 0x0000000000446451 in do_director_status (ua=0x1f701a18) at
>> ua_status.c:337
>> #7 0x00000000004467fd in status_cmd (ua=0x1f701a18, cmd=<value
>> optimized out>)
>> at ua_status.c:187
>> #8 0x000000000042f86d in do_a_command (ua=0x1f701a18) at ua_cmds.c:240
>> #9 0x0000000000444b5a in handle_UA_client_request (arg=<value optimized
>> out>)
>> at ua_server.c:146
>> #10 0x00002ad780c57ace in workq_server (arg=<value optimized out>)
>> at workq.c:346
>> #11 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #12 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 9 (Thread 0x439e0940 (LWP 20953)):
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=2, usec=0) at bsys.c:61
>> #2 0x0000000000422fc9 in jobq_server (arg=<value optimized out>) at
>> jobq.c:595
>> #3 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #4 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 8 (Thread 0x44de2940 (LWP 20957)):
>> #0 0x000000374b80d73b in read () from /lib64/libpthread.so.0
>> #1 0x0000003c54651bb8 in vio_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #2 0x0000003c54651c13 in vio_read_buff ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #3 0x0000003c54652d98 in ?? () from
>> /usr/lib64/mysql/libmysqlclient_r.so.15 #4 0x0000003c54653199 in
>> my_net_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #5 0x0000003c5464cbaf in cli_safe_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #6 0x0000003c5464dc19 in ?? () from
>> /usr/lib64/mysql/libmysqlclient_r.so.15 #7 0x0000003c5464c20e in
>> mysql_real_query ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #8 0x00002ad7805fa33c in db_sql_query (mdb=0x1f71b958,
>> query=0x1f7232f0 "SELECT Path.Path, Filename.Name, Temp.FileIndex,
>> Temp.JobI
>> d, LStat, MD5 FROM ( SELECT FileId, Job.JobId AS JobId, FileIndex,
>> File.PathId A
>> S PathId, File.FilenameId AS FilenameId, LStat, MD5 FROM Job, "...,
>> result_handler=0x40fe10<accurate_list_handler>, ctx=0x1f7061b8)
>> at mysql.c:337
>> #9 0x00002ad780604440 in db_get_file_list (jcr=<value optimized out>,
>> mdb=0x1f71b958, jobids=0x1f71b1a0 "101430,101929,102786",
>> result_handler=0x40fe10<accurate_list_handler>, ctx=0x1f7061b8)
>> at sql_get.c:1132
>> #10 0x00000000004104d5 in send_accurate_current_files (jcr=0x1f7061b8)
>> at backup.c:298
>> #11 0x00000000004114ed in do_backup (jcr=0x1f7061b8) at backup.c:445
>> #12 0x000000000042162e in job_thread (arg=<value optimized out>) at
>> job.c:314
>> #13 0x00000000004227d4 in jobq_server (arg=<value optimized out>) at
>> jobq.c:450
>> #14 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #15 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 7 (Thread 0x457e3940 (LWP 20961)):
>> #0 0x000000374b80d73b in read () from /lib64/libpthread.so.0
>> #1 0x00002ad780c321a6 in read_nbytes (bsock=0x1f70ec28, ptr=0x457e2d84
>> "8",
>> nbytes=4) at bnet.c:80
>> #2 0x00002ad780c35c1f in BSOCK::recv (this=0x1f70ec28) at bsock.c:451
>> #3 0x000000000041c321 in bget_dirmsg (bs=0x1f70ec28) at getmsg.c:137
>> #4 0x000000000042782e in msg_thread (arg=0x1f7061b8) at msgchan.c:390
>> #5 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #6 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 6 (Thread 0x46be5940 (LWP 20963)):
>> #0 0x000000374b80ad09 in pthread_cond_wait@@GLIBC_2.3.2 ()
>> from /lib64/libpthread.so.0
>> #1 0x00002ad780c4c956 in rwl_writelock (rwl=0x1f6eb3d8) at rwlock.c:242
>> #2 0x00002ad7805fc837 in _db_lock (file=0x2ad780611c1f "sql_update.c",
>> line=128, mdb=0x14) at sql.c:385
>> #3 0x00002ad780606094 in db_update_job_start_record (jcr=0x1f719328,
>> mdb=0x1f6eb3c8, jr=0x1f719768) at sql_update.c:112
>> #4 0x000000000042130c in job_thread (arg=<value optimized out>) at
>> job.c:287
>> #5 0x00000000004227d4 in jobq_server (arg=<value optimized out>) at
>> jobq.c:450
>> #6 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #7 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 5 (Thread 0x47fe7940 (LWP 20967)):
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=2, usec=0) at bsys.c:61
>> #2 0x0000000000422fc9 in jobq_server (arg=<value optimized out>) at
>> jobq.c:595
>> #3 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #4 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 4 (Thread 0x493e9940 (LWP 20971)):
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=2, usec=0) at bsys.c:61
>> #2 0x0000000000422fc9 in jobq_server (arg=<value optimized out>) at
>> jobq.c:595
>> #3 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #4 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 3 (Thread 0x4a7eb940 (LWP 20977)):
>> #0 0x000000374b80d73b in read () from /lib64/libpthread.so.0
>> #1 0x0000003c54651bb8 in vio_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #2 0x0000003c54651c13 in vio_read_buff ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #3 0x0000003c54652d98 in ?? () from
>> /usr/lib64/mysql/libmysqlclient_r.so.15 #4 0x0000003c54653199 in
>> my_net_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #5 0x0000003c5464cbaf in cli_safe_read ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #6 0x0000003c5464dc19 in ?? () from
>> /usr/lib64/mysql/libmysqlclient_r.so.15 #7 0x0000003c5464c20e in
>> mysql_real_query ()
>> from /usr/lib64/mysql/libmysqlclient_r.so.15
>> #8 0x00002ad7805fbda0 in UpdateDB (file=0x2ad780611c1f "sql_update.c",
>> line=123, jcr=0x4000, mdb=0xffffffffffffffff,
>> cmd=0x3232323330313d64<Address 0x3232323330313d64 out of bounds>)
>> at sql.c:298
>> #9 0x00002ad780606164 in db_update_job_start_record (jcr=0x1f714728,
>> mdb=0x1f6eb3c8, jr=0x1f714b68) at sql_update.c:123
>> #10 0x0000000000411211 in do_backup (jcr=0x1f714728) at backup.c:333
>> #11 0x000000000042162e in job_thread (arg=<value optimized out>) at
>> job.c:314
>> #12 0x00000000004227d4 in jobq_server (arg=<value optimized out>) at
>> jobq.c:450
>> #13 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #14 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 2 (Thread 0x4b1ec940 (LWP 20980)):
>> #0 0x000000374b80ad09 in pthread_cond_wait@@GLIBC_2.3.2 ()
>> from /lib64/libpthread.so.0
>> #1 0x00002ad780c4c956 in rwl_writelock (rwl=0x1f6eb3d8) at rwlock.c:242
>> #2 0x00002ad7805fc837 in _db_lock (file=0x2ad78060bfc0 "sql_create.c",
>> line=128, mdb=0x13) at sql.c:385
>> #3 0x00002ad7805fd97d in db_create_client_record (jcr=0x1f725068,
>> mdb=0x13,
>> cr=0x4b1eba20) at sql_create.c:485
>> #4 0x000000000041f19f in get_or_create_client_record (jcr=0x1f725068)
>> at job.c:867
>> #5 0x00000000004218ce in setup_job (jcr=0x1f725068) at job.c:162
>> #6 0x0000000000421d1b in run_job (jcr=0x1f6eb434) at job.c:89
>> #7 0x0000000000441580 in run_cmd (ua=0x1f7237a8, cmd=<value optimized
>> out>) at ua_run.c:180
>> #8 0x000000000042f86d in do_a_command (ua=0x1f7237a8) at ua_cmds.c:240
>> #9 0x0000000000444b5a in handle_UA_client_request (arg=<value optimized
>> out>)
>> at ua_server.c:146
>> #10 0x00002ad780c57ace in workq_server (arg=<value optimized out>)
>> at workq.c:346
>> #11 0x000000374b806617 in start_thread () from /lib64/libpthread.so.0
>> #12 0x000000374b0d3c2d in clone () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x2ad78108c280 (LWP 18020)):
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=60, usec=0) at bsys.c:61
>> #2 0x000000000042cd7a in wait_for_next_job (
>> one_shot_job_to_run=<value optimized out>) at scheduler.c:131
>> #3 0x000000000040d70c in main (argc=0, argv=0x7fff39454778) at dird.c:338
>> #0 0x000000374b80dfe1 in nanosleep () from /lib64/libpthread.so.0
>> No symbol table info available.
>> #1 0x00002ad780c30b8b in bmicrosleep (sec=60, usec=0) at bsys.c:61
>> 61 stat = nanosleep(&timeout, NULL);
>> Current language: auto; currently c++
>> timeout = {tv_sec = 60, tv_nsec = 0}
>> tv = {tv_sec = 140734154229024, tv_usec = 1272046151}
>> tz = {tz_minuteswest = 0, tz_dsttime = 110}
>> stat =<value optimized out>
>> #2 0x000000000042cd7a in wait_for_next_job (
>> one_shot_job_to_run=<value optimized out>) at scheduler.c:131
>> 131 bmicrosleep(next_check_secs, 0); /* recheck once per minute
>> */ jcr =<value optimized out>
>> job = (JOB *) 0x0
>> run =<value optimized out>
>> now =<value optimized out>
>> next_job =<value optimized out>
>> first = false
>> #3 0x000000000040d70c in main (argc=0, argv=0x7fff39454778) at dird.c:338
>> 338 while ( (jcr = wait_for_next_job(runjob)) ) {
>> ch =<value optimized out>
>> jcr = (JCR *) 0x4
>> no_signals = false
>> test_config = false
>> uid = 0x0
>> gid = 0x0
>> mode =<value optimized out>
>> #0 0x0000000000000000 in ?? ()
>> No symbol table info available.
>> #0 0x0000000000000000 in ?? ()
>> No symbol table info available.
>> #0 0x0000000000000000 in ?? ()
>> No symbol table info available.
>> #0 0x0000000000000000 in ?? ()
>> No symbol table info available.
>>
>>
>>
>> Can anyone shed some light on why my console is blocking at this point?
>> Maybe I'm misunderstanding the responses to the original post, but it
>> sounds like the Directory shouldn't be blocking.
>>
>> Note that additional consoles spawned can connect to the Director at
>> this point, but they too 'hang' when a Status of Director is requested.
>>
>> thanks!
>> Stephen
>>
>> On 04/23/2010 02:32 AM, Craig Ringer wrote:
>>> On 23/04/10 16:52, Graham Keeling wrote:
>>>> On Fri, Apr 23, 2010 at 10:31:49AM +0200, Eric Bollengier wrote:
>>>>> For normal Accurate backup, it uses dedicated connection since the
>>>>> beginning, so it's shouldn't block commands too.
>>>>
>>>> I do not think this works because for me, the 'status director' command
>>>> blocks until the accurate query is complete.
>>>
>>> It'd be rather handy to see the output of a:
>>> thread apply all bt
>>>
>>> command from gdb attached to the director while it's blocked trying to
>>> provide status information to a console.
>>>
>>> ( If your Bacula was installed from distro packages you may need to
>>> install additional debug info packages to get any useful output, though.
>>> )
>>>
>>> --
>>> Craig Ringer
>>>
>>> -------------------------------------------------------------------------
>>> ----- _______________________________________________
>>> Bacula-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/bacula-devel
--
Stephen Thompson Berkeley Seismological Laboratory
[email protected] 215 McCone Hall # 4760
404.538.7077 (phone) University of California, Berkeley
510.643.5811 (fax) Berkeley, CA 94720-4760
------------------------------------------------------------------------------
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel