Ok, So I hit the walltime again and restarted my after changing the
output directory.
It rewrote a few files and after the contig.fasta file it stoped
writting anything.
Now no file has been modified since 2 days ago and the processes are
still running.
I went and logged in to an exec node and I see my ray processes. The cpu
is running at 100% for each process, but the cpu time is split half and
half between user and system. This doesn't seem correct, system
shouldn't be so high.
If I strace a process I see that they are all looping on poll:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}, {fd=22,
events=POLLIN}, {fd=23, events=POLLIN}], 7, 0) = 0 (Timeout)
My ray command:
Ray -read-write-checkpoints -route-messages -connection-type debruijn
-routing-graph-degree 32 -k 23 -p ... -o combined_23_2
this time I ran ray with 50 nodes, 10 processes per node.(I lowered the
amount of nodes since last time).
Before I hit the walltime, Ray was Scaffolging and everything seemed to
go well.
I've used Ray quite a few times on smaller sets, I don't understand why
I'm having so much trouble this time around :-)
Louis
On 12-06-20 03:35 PM, Sébastien Boisvert wrote:
> When restarting Ray from checkpoints, you have to provide a different
> output directory.
>
> First job:
>
> mpiexec -n 4 Ray -o Sample_X.Ray -p file1.fastq file2.fastq \
> -read-write-checkpoints Sample_X.Checkpoints
>
> Second job:
>
> mpiexec -n 4 Ray -o Sample_X.Ray2 -p file1.fastq file2.fastq \
> -read-write-checkpoints Sample_X.Checkpoints
>
>
> If you are using v2.0.0-rc8, you don't have to provide
> the checkpoint directory because this option was added recently.
>
>
>
> Did you observe an improvement for the latency with and without
> message routing for your jobs ?
>
>
>
> Sébastien
>
> Louis Letourneau a écrit :
>> I guess I don't get it :-)
>>
>> I had set the option:
>> -read-write-checkpoints
>>
>> The job died, so I restarted it with the exact same parameters (simple
>> since it's in a .sh script)
>>
>> It crashed and got in the logs:
>> Error, combined_23/ already exists, change the -o parameter to another
>> value.
>>
>>
>> What setting do I give ray to resume thye assembly from checkpoints?
>>
>> Louis
>>
>> On 12-06-12 02:47 PM, Sébastien Boisvert wrote:
>>> Yes, it does that.
>>>
>>> There will be binary files with the ".ray" extension in the
>>> directory where you launched Ray.
>>>
>>>
>>> You can not change the k-mer length when starting from old checkpoints.
>>>
>>> The command needs to have the same number of arguments in the same order.
>>>
>>>
>>> On what kind of dataset are you exceeding time limits ?
>>>
>>>
>>> Louis Letourneau a écrit :
>>>> I saw these options on Ray
>>>> Checkpointing
>>>> -write-checkpoints
>>>> Write checkpoint files
>>>> -read-checkpoints
>>>> Read checkpoint files
>>>> -read-write-checkpoints
>>>> Read and write checkpoint files
>>>>
>>>>
>>>>
>>>> I'm hitting walltimes on the cluster I'm using and I'm wondering if by
>>>> setting:
>>>> -read-write-checkpoints
>>>>
>>>> I can resume where Ray got killed because of walltime?
>>>>
>>>> If that's the purpose, what a great feature! :-)
>>>>
>>>> Louis
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Live Security Virtual Conference
>>>> Exclusive live event will cover all the ways today's security and
>>>> threat landscape has changed and how IT managers can respond. Discussions
>>>> will include endpoint security, mobile security and the latest in malware
>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>> _______________________________________________
>>>> Denovoassembler-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Denovoassembler-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Denovoassembler-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Denovoassembler-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users