Hi Marlon, I am trying to recall some of my old memory related to job submission implementations. As far as I can remember, we implemented two-phase commit protocol with GSISSH. With two-phase commit protocol, we first get the job id and then submit the job in a single atomic step (So losing job id is not a problem). We had some discussions, implementing same for JSCH (at least a design) but I am not sure whether it was really integrated into Airavata. Maybe Lahiru (or whoever working on GFac) can give more information about this.
However, two-phase commit protocol only works if the underlying job scheduler is able to give a job id without actually submitting the job. As far as I can remember moab is capable of doing that but not sure about job schedulers such as slurm. IMO, "longer connection timeout" is not the perfect solution but it could be a good workaround. Thanks -Thejaka Amila On Thu, May 12, 2016 at 2:04 PM, Pierce, Marlon <marpi...@iu.edu> wrote: > We have an occasional issue of connection timeouts when performing remote > SSH operations. This has a potentially bad side effect of successfully > launching a large job but not getting back the Job ID. One straightforward > fix is to use longer than the default connection timeout in the Jsch > clients. > > Looking through the code, I don’t see that we are doing this. Is this > correct? Would there be some unintended consequences for using something > longer, like 60 seconds? The default is 20 seconds. > > There is also a longer discussion about the right way to handle these > events in the first place. We may not want to depend on the standard output > at all. Increasing the timeouts would at least put a bandaid on the current > issue. > > Marlon > > > >