Re: [Denovoassembler-users] about Ray Using

Sébastien Boisvert Thu, 22 Mar 2012 07:16:28 -0700

Hello,

On Wed, 2012-03-21 at 12:15 +0800, Tiger-Fuliang Xie wrote:
> HI, Dr.Boisvert

I am no doctor. ;)

> Thanks for your software Ray. 

> It looks very powerful for assembling parallel genome for parallel DNA
> sequencing.

Yes, parallel software for parallel sequencers.

> I have a couple of questions  to ask you.

ok.

> 1. I have a linux server with 12 CPUs and 16 G RAM. Can I set up it on
> my platform? I don't quite understand MPI technology. 
>     if not so, should I install the program on a HPC.

Yes, you can install Ray on your server.

Let me explain the message-passing interface (MPI) technology.

A parallel software, in general, can either run on a single machine or
on many machines connected with a network.

1. single machine

On the single machine, threads (usually one per compute core) can
communicate using data structures stored in the same virtual memory
address space -- there is no inter-process communication when running a
program with, let's say, 8 threads.

Also, on a single machine, processes (usually one per compute core) can
be used instead of threads. However, processes can not share their
virtual memory address space, so inter-process communication is needed.

MPI is just the easiest way of doing inter-process communication while
also being portable. And the same code will work on various hardware
too.

2. many machines

With many machines, you need distributed processes. With a single
machine, all your processes are on the same machine. With many machine,
they are on many machines.
That is where MPI helps too, inter-process communication between
machines.

> 2. Basically, when i download a parallel genome sequencing datasets,
> each dataset is huge up to 36.2 G (in SRA format, might be more size
> when it transferred to fastq).

You can also download your data from the  European Nucleotide Archive,
which uses directly the fastq format. In my experience, conversion from
SRA to FASTQ is time consuming.

>    so, can I use your example command" mpiexec -np NUMBER_OF_RANKS Ray
> -k KMERLENGTH -p l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq -o
> test"?

Yes, that is how you run Ray.

mpiexec -n 12 Ray -k 31 -p l1_1.fastq l1_2.fastq -p l2_1.fastq
l2_2.fastq -o test

>  that also means if i can add all parallel dataset together, like " -p
> l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq". 

Not sure I understand that question, but you should not add reads from
different samples unless this is what you want to do.

You can have at most around 200 -p arguments, I think.

> also, there is just one file including pair-end sequences. what will
> the command be like, such as " -p11.fastq -p 12.fastq ?

If you have, let's say, 1000000 pairs of sequences, then you should have
two files with 1000000 sequences each. The argument -p takes these two
files.

If you have 4 sets of pairs of sequences, then you need 4 -p arguments

 -p 1_1.fastq 1_2.fastq -p 2_1.fastq 2_2.fastq -p 3_1.fastq 3_2.fastq -p
4_1.fastq 4_2.fastq

>  does Ray will recognize the  pair-end sequences.
> 

Yes. Basically, sequence i in the first file is paired with sequence i
in the second file.

You can also provide interleaved sequences with argument -i.

> 3. as the size of inputting sequencing dataset grows, it is hard to
> satisfy the RAM requirement.

Indeed. That is where scaling on many machine comes in handy, in the US,
you have several super-computers accessible to academy.

> Do you think   I use my linux server to do the job.
> 

It really depends what you work with.

A single bacterial genome can take 3 GB to 10 GB, depending on the
setup.

A metagenome takes easily 30 GB. 60 GB with taxonomic profiling with the
current code.

A mammalian genome takes > 200 GB, depending on the side.

> 4.what 's the least RAM for parallel DNA sequencing
> 
> 

See above.

Keep in mind that Ray distribute everything.

So if you globally need 30 GB, having 30 distributed compute cores with
1 GB each will do the job.

Each MPI rank initially allocates about 30-40 MB for communication
buffers, and 60 MB for a Bloom filter.

> many thanks
> -- 
> Fuliang Xie 
> Department of Biology,
> East Carolina University,
> Greenville, NC, 27858
> 
> 

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] about Ray Using

Reply via email to