Hi Marlon,

I am trying to recall some of my old memory related to job submission
implementations.
As far as I can remember, we implemented two-phase commit protocol with
GSISSH. With two-phase commit protocol, we first get the job id and then
submit the job in a single atomic step (So losing job id is not a problem).
We had some discussions, implementing same for JSCH (at least a design) but
I am not sure whether it was really integrated into Airavata. Maybe Lahiru
(or whoever working on GFac) can give more information about this.

However, two-phase commit protocol only works if the underlying job
scheduler is able to give a job id without actually submitting the job. As
far as I can remember moab is capable of doing that but not sure about job
schedulers such as slurm.

IMO, "longer connection timeout" is not the perfect solution but it could
be a good workaround.

Thanks
-Thejaka Amila

On Thu, May 12, 2016 at 2:04 PM, Pierce, Marlon <marpi...@iu.edu> wrote:

> We have an occasional issue of connection timeouts when performing remote
> SSH operations. This has a potentially bad side effect of successfully
> launching a large job but not getting back the Job ID. One straightforward
> fix is to use longer than the default connection timeout in the Jsch
> clients.
>
> Looking through the code, I don’t see that we are doing this. Is this
> correct?   Would there be some unintended consequences for using something
> longer, like 60 seconds? The default is 20 seconds.
>
> There is also a longer discussion about the right way to handle these
> events in the first place. We may not want to depend on the standard output
> at all. Increasing the timeouts would at least put a bandaid on the current
> issue.
>
> Marlon
>
>
>
>

Reply via email to