Thank you very much for the complete response.
I already started to play with -route-messages it really makes a difference!
It thought the default was -connection-type debruijn already so I didn't
set it.

As for the -one-color-per-file option, I was just trying stuff out.

You are right about Guillimin and it isn't fixed yet, I try once in awhile.

The only part I don't quite get, I'll need to read up on this, is
routing-graph-degree. I'm not an mpi developper, yet, but I do program
openMP and the like jobs, so I hope the learning curve isn't too steep :-)

Louis

On 12-06-18 10:42 AM, Sébastien Boisvert wrote:
> Hi Louis,
> 
> I can see that you are running large jobs.
> 
> I think the problem is about message transit latency.
> 
> Some definitions:
> 
> *latency:* the required time to receive a response from a MPI rank to which
> a message was sent. Usually measured in micro seconds.
> 
> *route: *the path that will be utilised to transit a message from MPI
> rank A to MPI rank B.
> 
> *degree: *the number of direct connections of a MPI rank
> *
> diameter:* the maximum number of arcs for any shorter route in a
> communication
> graph.
> 
> 
> Possible solution
> 
> #PBS -N LargeJob3
> #PBS -N LargeJob3
> #PBS -l walltime=120:00:00
> #PBS -l nodes=100
> #PBS -l walltime=120:00:00
> #PBS -l walltime=120:00:00
> 
> 
> -route-messages -connection-type debruijn -routing-graph-degree 8
> 
> 
> *Large job latency and its solution*
> 
> There is already a solution to that, as running large jobs can be
> tricky if the hardware is not designed for that in the first place.
> 
> The take-home message is that you need to turn on software message routing
> implemented in RayPlatform by just adding an option (see below).
> 
> See the Ray documentation
> https://github.com/sebhtml/ray/blob/master/Documentation/Very-Large-Jobs.txt
> 
> See this document for acceptable communication graphs for routing your
> messages:
> 
> https://github.com/sebhtml/ray/blob/master/Documentation/Routing.txt
> 
> 
> Below, I first describe what I think is happening with your large jobs and
> then I provide a possible solution.
> 
> 
> Louis Letourneau a écrit :
>> ~300MB genome
>>
>> 2 librarires of paires (hiseq2000) (2nd library on 2 lanes)
>> 96499384 x 2
>> 47407253 x 2
>> 13776832 x 2
>>
>> 2 libraries of mates 5kb (done with GAIIx)
>> 26300454 x 2
>> 25851346 x 2
>>
>> Total 419670538 reads, 41967053800 bases, 139x (I know it's a bit high)
>>
>> On Mammouth, if you know the cluster,
> 
> *Comparing super computers*
> 
> I tried Mammouth in the past. But our group only has an special
> allocation on colosse.
> I did some tests and there was not any differences between Opteron(R)
> processors
> on Mammouth and Xeon(R) Nehalem(R) processors on colosse because the Ray
> code
> does not perform any floating point operations. Both Mammouth and
> Colosse have
> the Infiniband(R) ConnectX(R) from Mellanox, Inc. for the network
> interconnect.
> 
> We also tried guillimin, but there was (and still is) a bug in the
> middleware provided by Intel, Inc.
> that caused (and still causes) some messages to be corrupted.
> This makes Open-MPI crash and thus Ray jobs are unable to complete on
> guillimin
> the last time I checked.
> We never got any update about it.
> 
> In this case, the network latency on mammouth will be awful because
> the hardware is Mellanox ConnectX.  This hardware is good for
> transfering large messages (in the
> mega-byte range). Otherwise, the latency is not very good. Colosse has
> the hardware. And I think
> that there are other issue such as not using PCIexpress at the
> appropriate rate.
> 
> 
>>  using 75 nodes, 10 of 24 cores
>> (because of ram, I cannot use all 24 cores).
> 
> *Larges jobs and latency*
> 
> My personal record is higher.
> I simulated a metagenome with 3 000 000 000 reads and 1000 bacterial
> genomes.
> And the assemblathon bird was 3 072 136 293 reads.
> 
> For large jobs, the problem is about latency, not about anything else,
> unless your data contain a lot of sequencing errors.
> 
> The Illumina(R) HiSeq(R) 2000 from Illumina, Inc. has an error rate
> valued at 0.2-0.5% in my
> experience.
> 
> The Illumina(R) Genome Analyzer(R) IIx from Illumina, Inc. has an error
> rate valued at 0.5-1.5%
> in my experience.
> 
> 
> *Why latency matters*
> 
> The latency increases for the following reasons:
> 
> In Ray, each MPI rank sends about 5000-8000 messages per second to other
> ranks.
> Message reception is done with a round-robin policy. Any MPI rank is
> clocked to run at the highest frequency
> allowed.
> 
> For example, on colosse with the fancy Xeon(R) Nehalem(R) from Intel
> Inc., the granularity is around 500 nano
> seconds. In other words, there about 1.7 millions cycles completed each
> second for each MPI rank.
> On colosse, the latency is between 15 and 120 micro seconds, depending
> on the number
> of processor cores.
> 
> You can obtain runtime information about scheduling in the Scheduling/
> directory.
> The network latency is reported in NetworkTest.txt
> 
> When the latency increases, the number of messages sent per second
> decreases and the
> wall-clock time increases too.
> 
> So basically your processor cores are just waiting for interrupts from the
> Infiniband cards.
> 
> *Large jobs and memory usage*
> 
> Also, the memory usage with this number of processor cores will be
> rididulous because
> of the number of connections implied. with 75 nodes and 10 processes per
> node, this is 750 MPI ranks,
> And it is 562500 communication links.
> 
> 
> So you are using 75 nodes with 10 Ray processes per node, for a total
> of 750 Ray processes. With a large number of cores like this, you will
> either need
> to run your job on hardware to support transparent message routing, or
> you can just
> enable software message routing available already in Ray. This capacity is
> implemented in the parallel engine on which Ray runs: RayPlatform.
> 
> 
>> 120hours walltime is not enough
>>
>> command (I removed file names and some PBS settings for brievety):
>> #PBS -l nodes=100
>> #PBS -l walltime=120:00:00
>> export ppn=10
> 
> With 100 nodes and 10 processes per nodes, you have 1000 MPI ranks and
> 1000000 communication
> links.
> 
>> mpiexec -n $[PBS_NUM_NODES*ppn] -npernode $ppn Ray-v2.0.0-rc8/Ray -k 23
>> -p pairs -p pairs -p pairs -p mates -p mates -o combined_23
>> -one-color-per-file -write-checkpoints > combined_23.out 2>&1
> 
> In the dev. version, the checkpoints are written to a directory instead
> of the
> current working directory. It is more user-friendly that way.
> 
> 
> First, -one-color-per-file is for the colored de Bruijn graphs, which
> you are
> not using I think. These are helpful for profiling, for example, genera in a
> microbiome. Our paper about that is under review.
> 
> 
> It seems that you are using a lot of processor cores.
> 
> 
>> Louis
>>
>> On 6/12/2012 2:47 PM, Sébastien Boisvert wrote:
>>> Yes, it does that.
>>>
>>> There will be binary files with the ".ray" extension in the
>>> directory where you launched Ray.
>>>
>>>
>>> You can not change the k-mer length when starting from old checkpoints.
>>>
>>> The command needs to have the same number of arguments in the same order.
>>>
>>>
>>> On what kind of dataset are you exceeding time limits ?
>>>
>>>
>>> Louis Letourneau a écrit :
>>>> I saw these options on Ray
>>>> Checkpointing
>>>>         -write-checkpoints
>>>>                Write checkpoint files
>>>>         -read-checkpoints
>>>>                Read checkpoint files
>>>>         -read-write-checkpoints
>>>>                Read and write checkpoint files
>>>>
>>>>
>>>>
>>>> I'm hitting walltimes on the cluster I'm using and I'm wondering if by
>>>> setting:
>>>> -read-write-checkpoints
>>>>
>>>> I can resume where Ray got killed because of walltime?
>>>>
>>>> If that's the purpose, what a great feature! :-)
>>>>
>>>> Louis
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Live Security Virtual Conference
>>>> Exclusive live event will cover all the ways today's security and
>>>> threat landscape has changed and how IT managers can respond. Discussions
>>>> will include endpoint security, mobile security and the latest in malware
>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>> _______________________________________________
>>>> Denovoassembler-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and 
>>> threat landscape has changed and how IT managers can respond. Discussions 
>>> will include endpoint security, mobile security and the latest in malware 
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Denovoassembler-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
> 

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to