Von: Sébastien Boisvert [mailto:[email protected]]
>> You mentioned yourself that there's no need to store the reverse
complement
>> when in colour-space. To get the reverse complement in base space, you
>> reverse complement the first base, convert to base space, then reverse the
>> sequence.
> Complement the first base and reverse the color -- this is the recipe to
> "reverse-complement" a color-space read.
> I think I am starting to get it.

Be careful with this. Order of processing matters a lot with colour space.
You need to reverse the resulting *base-space* sequence, rather than
reversing the colour space sequence then working it out in base space. (e.g.
the reverse complement of A3200233 is reverse(T3200233), not T3320023)

>> If there is a good chance of a match between two reads, and one read has
an
>> unknown first base, then you can infer that base from the other read.
> Yes, but keep in mind that Ray never computes pairwise similarity.

Sure. In the scenario I described, both sequences would have exactly the same
colour-space representation (excluding first base) -- no pairwise differences
necessary. The only difference is that one can be converted unambiguously to
a base-space sequence (known first base), and the other has up to 4
base-space representations (unknown first base).

> Like in Velvet, Ray uses 2 bits per symbol.

And also a flag for whether or not the kmer is in colour-space (or all kmers
in colour space), I presume. For each kmer (assuming you want to be able to
output in base-space), Ray will also need to record a first base, preferably
in a separate structure, but it could just be the first 2-bit symbol in the
sequence.

> a path can obviously start in the middle of a read -- thus in that case
> the first base would remain unknown. (right?)

>From each read, you can generate putative first bases for any subsequence of
an uninterrupted <first base>[0123]+ sequence. This requires converting the
sequence to base space, and inserting the converted base at the appropriate
position. I'll try to demonstrate this starting with a colour-space sequence:

A2112322311010133121320003202203201302321

This has starting base A, complementary transitions have colour 3,
non-complementary are 1,2 depending on how far away they are in the alphabet
[just FYI, that's how I remember it]:

AGTGATCTACAACCATACTGCTTTTAGGAGGCTTGCCTAGT [or something like that --
hopefully I converted it correctly]

If I start with the colour-space sequence, I can work out the 'starting base'
at any position by converting to base-space. For example, before the string
of 3 0s, you can insert a T:
<A>211232231101013312132<T>0003202203201302321

I'll try working through a scenario. Let's say I want the sequence split up
into groups of 10-mers:

2112322311  0101331213  2000320220  3201302321

I know the first base for the first group:

<A>2112322311  0101331213  2000320220  3201302321

I can convert that first group to base space, and the last base of that
converted group is the first base for the next group:

(<A>2112322311 / AGTGATCTAC) <C>0101331213  2000320220  3201302321

and so on: 

<A>2112322311 <C>0101331213 <C>2000320220 <G>3201302321

If there's a misread somewhere, any sequences past the misread will have
ambiguous colour-space -> base-space translations:

<A>2112322311 <C>01013X1213 <N>2000320220 <N>3201302321

The problem is that for a sufficiently large dataset (or error-containing
dataset), you'll get disagreements about the starting base for a given
sequence. If Ray were to record the counts for each observed starting base,
it might be possible to reduce this error (e.g. pick the most frequently
occurring starting base), bearing in mind that the starting base for sequence
closer to the start of a read will be more reliable than the calculated
starting bases at the end of a read.

Hope this helps,

David Eccles (gringer)

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to