Re: [galaxy-dev] DRMAA job will now be errored -> Segmentation fault

Nate Coraor Tue, 27 Nov 2012 09:20:27 -0800

On Nov 27, 2012, at 12:03 PM, Peter Cock wrote:

> On Tue, Nov 27, 2012 at 4:50 PM, Nate Coraor <n...@bx.psu.edu> wrote:
>> On Nov 20, 2012, at 8:15 AM, Peter Cock wrote:
>>> 
>>> Is anyone else seeing this? I am wary of applying the update to our
>>> production Galaxy until I know how to resolve this (other than just
>>> be disabling task splitting).
>> 
>> Hi Peter,
>> 
>> These look like two issues - in one, you've got task(s) in the database
>> that do not have an external runner ID set, causing the drmaa runner
>> to attempt to check the status of "None", resulting in the segfault.
> 
> So a little defensive coding could prevent the segfault then (leaving
> the separate issue of why the jobs lack this information)?


Indeed, I pushed a check for this in 4a95ae9a26d9.

>> If you update the state of these tasks to something terminal, that
>> should fix the issue with them.
> 
> You mean manually in the database? Restarting Galaxy seemed
> to achieve that in a round-about way.
> 
>> Of course, if the same things happens with new jobs, then there's
>> another issue.
> 
> This was a week ago, but yes, at the time it was reproducible
> with new jobs.

Is that to say it's still happening and you've simply worked around it (by 
disabling tasks), or that it is no longer happening?

>> I'm trying to reproduce the working directory behavior but have
>> been unsuccessful.  Do you have any local modifications to the
>> splitting or jobs code?
> 
> This was running on my tools branch, which shouldn't be changing
> Galaxy itself in any meaningful way (a few local variables did get
> accidentally checked into my run.sh file etc but otherwise I only
> try to modify new files specific to my individual tool wrappers):
> 
> https://bitbucket.org/peterjc/galaxy-central/src/tools
> 
> [galaxy@ppserver galaxy-central]$ hg branch
> tools
> 
> [galaxy@ppserver galaxy-central]$ hg log -b tools | head -n 8
> changeset:   8807:d49200df0707
> branch:      tools
> tag:         tip
> parent:      8712:959ee7c79fd2
> parent:      8806:340438c62171
> user:        peterjc <p.j.a.c...@googlemail.com>
> date:        Thu Nov 15 09:38:57 2012 +0000
> summary:     Merged default into my tools branch
> 
> The only deliberate change was to try and debug this,
> 
> [galaxy@ppserver galaxy-central]$ hg diff
> diff -r d49200df0707 lib/galaxy/jobs/runners/drmaa.py
> --- a/lib/galaxy/jobs/runners/drmaa.py        Thu Nov 15 09:38:57 2012 +0000
> +++ b/lib/galaxy/jobs/runners/drmaa.py        Tue Nov 27 17:00:04 2012 +0000
> @@ -291,8 +291,15 @@
>         for drm_job_state in self.watched:
>             job_id = drm_job_state.job_id
>             galaxy_job_id = drm_job_state.job_wrapper.job_id
> +            if job_id is None or job_id=="None":
> +                log.exception("(%s/%r) Unable to check job status
> none" % ( galaxy_job_id, job_id ) )
> +                #drm_job_state.fail_message = "Cluster could not
> complete job (job_id None)"
> +                #Ignore it?
> +                #self.work_queue.put( ( 'fail', drm_job_state ) )
> +                continue
>             old_state = drm_job_state.old_state
>             try:
> +                assert job_id is not None and job_id != "None"
>                 state = self.ds.jobStatus( job_id )
>             # InternalException was reported to be necessary on some DRMs, but
>             # this could cause failures to be detected as completion!  Please
> 
> I'm about to go home for the day but should be able to look
> into this tomorrow, e.g. update to the latest default branch.

Great, thanks.

--nate

> 
> Thanks,
> 
> Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] DRMAA job will now be errored -> Segmentation fault

Reply via email to