Hi Gene,
So I thought sure Shawn had told me that I needed to *not* supply the ampersand
when running the coordinator because --background was supposed to handle that,
but I tried supplying the ampersand just in case (because I couldn't think of
anything else) and it looks like I finally see the coordinator running after
starting it up:
/usr/sbin/lsof | grep dmtcp_coo
dmtcp_coo 21728 rwleach cwd DIR 0,22 176128 1465016433
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS
dmtcp_coo 21728 rwleach rtd DIR 8,1 4096 2 /
dmtcp_coo 21728 rwleach txt REG 0,21 2388774 5097449955
/ifs/util/util64/dmtcp/1.2.7/bin/dmtcp_coordinator
(isilon-nas.ccr.buffalo.edu:/ifs)
dmtcp_coo 21728 rwleach mem REG 8,1 156872 262151
/lib64/ld-2.12.so
dmtcp_coo 21728 rwleach mem REG 8,1 598800 246190
/lib64/libm-2.12.so
dmtcp_coo 21728 rwleach mem REG 8,1 1918016 262152
/lib64/libc-2.12.so
dmtcp_coo 21728 rwleach mem REG 8,1 145720 262160
/lib64/libpthread-2.12.so
dmtcp_coo 21728 rwleach mem REG 8,1 93224 246206
/lib64/libgcc_s-4.4.6-20120305.so.1
dmtcp_coo 21728 rwleach mem REG 8,1 989840 397663
/usr/lib64/libstdc++.so.6.0.13
dmtcp_coo 21728 rwleach mem REG 8,1 65928 245790
/lib64/libnss_files-2.12.so
dmtcp_coo 21728 rwleach mem REG 8,1 23792 262147
/lib64/libnss_sss.so.2
dmtcp_coo 21728 rwleach 0u CHR 1,3 0t0 3772
/dev/null
dmtcp_coo 21728 rwleach 1w CHR 1,3 0t0 3772
/dev/null
dmtcp_coo 21728 rwleach 2w CHR 1,3 0t0 3772
/dev/null
dmtcp_coo 21728 rwleach 3u unix 0xffff880c2ddf0980 0t0 187632
socket
dmtcp_coo 21728 rwleach 4u IPv4 187640 0t0 TCP
*:42692 (LISTEN)
dmtcp_coo 21728 rwleach 5u CHR 1,3 0t0 3772
/dev/null
dmtcp_coo 21728 rwleach 825w CHR 1,3 0t0 3772
/dev/null
dmtcp_coo 21728 rwleach 831r DIR 0,22 4096 3822667854
/panfs/panfs.ccr.buffalo.edu/scratch/rwleach/tmp
ps
PID TTY TIME CMD
21622 ? 00:00:00 tcsh
21712 ? 00:00:00 4038989.d15n41.
21728 ? 00:00:00 dmtcp_coordinat
21735 ? 00:00:00 ps
So it looks like I'm past this issue. Not entirely sure though.
Rob
On Jun 12, 2013, at Jun12, 5:31 PM, Gene Cooperman wrote:
> Hi Robert,
> Thanks for writing. It's not obvious to me what's happening.
> But here's a quick question, for diagnosing it.
> After starting the coordinator, could you run:
> lsof | grep dmtcp_coo
> Alternatively, could you try: lsof | grep <PORT_NUM>
> where PORT_NUM is the supposed port number of the coordinator?
>
> Let's verify that the coordinator is truly listening on the port
> that it says it is.
>
> Kapil,
> Could you please check in your code with the --port-file option?
> Then we can make sure that we're all testing a common source, and there
> is no issue about different versions.
> Also, I presume you've already tested something similar to what
> Robert is doing below. Is that correct?
>
> Thanks,
> - Gene
>
> On Wed, Jun 12, 2013 at 05:17:26PM -0400, Robert William Leach wrote:
>> Hi,
>>
>> For the life of me, I cannot figure out why, when I run dmtcp_checkpoint, I
>> get an error about not being able to connect to the coordinator. Here are
>> snippets from my script - it's all in 1 script - and the output I get from
>> each of these commands. Help?
>>
>> dmtcp_coordinator --port 0 --background --exit-on-last --port-file
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.port
>> --ckptdir
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.ckpt1
>> --tmpdir /panasas/scratch/rwleach/tmp
>>
>> dmtcp_coordinator starting...
>> Port: 34511
>> Checkpoint Interval: disabled (checkpoint manually instead)
>> Exit on last client: 1
>> The port number was written to file
>> (/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.port)
>> Backgrounding...
>>
>> dmtcp_checkpoint --no-gzip --join --port 34511 --tmpdir
>> /panasas/scratch/rwleach/tmp --ckptdir
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.ckpt1
>> --quiet /util/meme/4.6.0/bin/meme.bin
>> LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme -dna -mod zoops -minw 6
>> -maxw 25 -revcomp -nostatus -p 8 -o
>> LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memepeak150-8cores
>> -maxsize 30000000 1>
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores
>> 2>
>> /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.err
>> &
>>
>> [15030] ERROR at dmtcpcoordinatorapi.cpp:81 in
>> createNewConnectionToCoordinator; REASON='JASSERT(fd.isValid()) failed'
>> coordinatorAddr = d06n40b.ccr.buffalo.edu
>> coordinatorPort = 34511
>> Message: Failed to connect to DMTCP coordinator
>> meme.bin (15030): Terminating...
>>
>> env | grep DMTCP
>>
>> DMTCP_HOST=d06n40b.ccr.buffalo.edu
>> DMTCP=/util/dmtcp/1.2.7
>> DMTCP_CHECKPOINT_DIR=/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_peaks.bed.pad150.formeme.memeout-8cores.
>> ckpt1
>> DMTCP_GZIP=0
>> DMTCP_TMPDIR=/panasas/scratch/rwleach/tmp
>>
>>
>> http://SwingBuffalo.com/
>> - Phone Swing Buffalo or sign up for our email list via the contact page on
>> our website!
>> http://RhythmShuffle.com/
>> http://LindyFix.com/
>>
>>
>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>
>> _______________________________________________
>> Dmtcp-forum mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum