Hi Gene,

So I thought sure Shawn had told me that I needed to *not* supply the ampersand 
when running the coordinator because --background was supposed to handle that, 
but I tried supplying the ampersand just in case (because I couldn't think of 
anything else) and it looks like I finally see the coordinator running after 
starting it up:

/usr/sbin/lsof | grep dmtcp_coo
dmtcp_coo 21728 rwleach  cwd       DIR               0,22   176128 1465016433 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS
dmtcp_coo 21728 rwleach  rtd       DIR                8,1     4096          2 /
dmtcp_coo 21728 rwleach  txt       REG               0,21  2388774 5097449955 
/ifs/util/util64/dmtcp/1.2.7/bin/dmtcp_coordinator 
(isilon-nas.ccr.buffalo.edu:/ifs)
dmtcp_coo 21728 rwleach  mem       REG                8,1   156872     262151 
/lib64/ld-2.12.so
dmtcp_coo 21728 rwleach  mem       REG                8,1   598800     246190 
/lib64/libm-2.12.so
dmtcp_coo 21728 rwleach  mem       REG                8,1  1918016     262152 
/lib64/libc-2.12.so
dmtcp_coo 21728 rwleach  mem       REG                8,1   145720     262160 
/lib64/libpthread-2.12.so
dmtcp_coo 21728 rwleach  mem       REG                8,1    93224     246206 
/lib64/libgcc_s-4.4.6-20120305.so.1
dmtcp_coo 21728 rwleach  mem       REG                8,1   989840     397663 
/usr/lib64/libstdc++.so.6.0.13
dmtcp_coo 21728 rwleach  mem       REG                8,1    65928     245790 
/lib64/libnss_files-2.12.so
dmtcp_coo 21728 rwleach  mem       REG                8,1    23792     262147 
/lib64/libnss_sss.so.2
dmtcp_coo 21728 rwleach    0u      CHR                1,3      0t0       3772 
/dev/null
dmtcp_coo 21728 rwleach    1w      CHR                1,3      0t0       3772 
/dev/null
dmtcp_coo 21728 rwleach    2w      CHR                1,3      0t0       3772 
/dev/null
dmtcp_coo 21728 rwleach    3u     unix 0xffff880c2ddf0980      0t0     187632 
socket
dmtcp_coo 21728 rwleach    4u     IPv4             187640      0t0        TCP 
*:42692 (LISTEN)
dmtcp_coo 21728 rwleach    5u      CHR                1,3      0t0       3772 
/dev/null
dmtcp_coo 21728 rwleach  825w      CHR                1,3      0t0       3772 
/dev/null
dmtcp_coo 21728 rwleach  831r      DIR               0,22     4096 3822667854 
/panfs/panfs.ccr.buffalo.edu/scratch/rwleach/tmp

ps
  PID TTY          TIME CMD
21622 ?        00:00:00 tcsh
21712 ?        00:00:00 4038989.d15n41.
21728 ?        00:00:00 dmtcp_coordinat
21735 ?        00:00:00 ps

So it looks like I'm past this issue.  Not entirely sure though.

Rob

On Jun 12, 2013, at Jun12, 5:31 PM, Gene Cooperman wrote:

> Hi Robert,
>    Thanks for writing.  It's not obvious to me what's happening.
> But here's a quick question, for diagnosing it.
> After starting the coordinator, could you run:
>   lsof | grep dmtcp_coo
> Alternatively, could you try:  lsof | grep <PORT_NUM>
> where PORT_NUM is the supposed port number of the coordinator?
> 
> Let's verify that the coordinator is truly listening on the port
> that it says it is.
> 
> Kapil,
>    Could you please check in your code with the --port-file option?
> Then we can make sure that we're all testing a common source, and there
> is no issue about different versions.
>    Also, I presume you've already tested something similar to what
> Robert is doing below.  Is that correct?
> 
> Thanks,
> - Gene
> 
> On Wed, Jun 12, 2013 at 05:17:26PM -0400, Robert William Leach wrote:
>> Hi,
>> 
>> For the life of me, I cannot figure out why, when I run dmtcp_checkpoint, I 
>> get an error about not being able to connect to the coordinator.  Here are 
>> snippets from my script - it's all in 1 script - and the output I get from 
>> each of these commands.  Help?
>> 
>> dmtcp_coordinator --port 0 --background --exit-on-last --port-file 
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.port
>>  --ckptdir 
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.ckpt1
>>  --tmpdir /panasas/scratch/rwleach/tmp
>> 
>> dmtcp_coordinator starting...
>>    Port: 34511
>>    Checkpoint Interval: disabled (checkpoint manually instead)
>>    Exit on last client: 1
>> The port number was written to file 
>> (/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.port)
>> Backgrounding...
>> 
>> dmtcp_checkpoint --no-gzip --join --port 34511 --tmpdir 
>> /panasas/scratch/rwleach/tmp --ckptdir 
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.ckpt1
>>  --quiet /util/meme/4.6.0/bin/meme.bin 
>> LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme -dna -mod zoops -minw 6 
>> -maxw 25 -revcomp -nostatus -p 8 -o 
>> LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memepeak150-8cores 
>> -maxsize 30000000 1> 
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores
>>  2> 
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.err
>>  &
>> 
>> [15030] ERROR at dmtcpcoordinatorapi.cpp:81 in 
>> createNewConnectionToCoordinator; REASON='JASSERT(fd.isValid()) failed'
>>     coordinatorAddr = d06n40b.ccr.buffalo.edu
>>     coordinatorPort = 34511
>> Message: Failed to connect to DMTCP coordinator
>> meme.bin (15030): Terminating...
>> 
>> env | grep DMTCP
>> 
>> DMTCP_HOST=d06n40b.ccr.buffalo.edu
>> DMTCP=/util/dmtcp/1.2.7
>> DMTCP_CHECKPOINT_DIR=/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.
>> ckpt1
>> DMTCP_GZIP=0
>> DMTCP_TMPDIR=/panasas/scratch/rwleach/tmp
>> 
>> 
>> http://SwingBuffalo.com/
>> - Phone Swing Buffalo or sign up for our email list via the contact page on 
>> our website!
>> http://RhythmShuffle.com/
>> http://LindyFix.com/
>> 
>> 
> 
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>> 
>> Build for Windows Store.
>> 
>> http://p.sf.net/sfu/windows-dev2dev
> 
>> _______________________________________________
>> Dmtcp-forum mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> 
> 

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to