On Tue, 28 Jun 2011 19:49:42 David Eccles (gringer) wrote:
> I need to go a bit deeper and replicate this kind of format for Kmers
> for it to be really useful, but I was able to fit things in this far
> without changing too much other code.
This will probably be quite an over-reaching change. To conserve
memory/space, I think it would be reasonable to put the colour-space and
first base flags into the m_u64 of the Kmers, something like this:
01 23 456789012345678901...
CK FB 1 2 3 4 5 6 7 8 9
I suggest keeping things at a 2-bit boundary because it makes working
out base-space / colour-space positions a little easier.
Bit 0: Colour-space flag (1 -- kmer in colour space, 0 otherwise)
Bit 1: First-base known flag (1 -- first base is known, 0 otherwise)
[always set to 1 if bit 0 is 0 -- i.e. in base-space]
Bit 2-3:
bit 1 == 1:
First-base 00/01/10/11 -> A/C/G/T [as usual]
otherwise:
??? possibly checksum (modulo 4 sum of 2-bit chunks)
Bit 4 onwards:
sequence in colour-space, or remaining sequence in base-space
Adding this will use 2 extra bits, making the max kmer length for one
64-bit value 31 bases:
#define KMER_REQUIRED_BITS (2*MAXKMERLENGTH+2)
Note that the first base is stored in bits 2-3, so that increases the
kmer size by 1 for base-space sequences, making the effective length for
a given Kmer the same in both base-space and colour-space.
When hashing, bits 1,2,3 should not be considered, because they could
change over the course of the search / assembly process.
for storing/unpacking the kmer in base-space, just change increment the
initial bit location by 1 (i.e. i=0 -> i=1).
For checking forward sequences, equality for two sequences in base-space
compares bits 2 onwards. Equality for two sequences in colour-space
compares bits 4 onwards, then (if a match is found), check bit 1. If bit
1 is the same in both sequences, declare mismatch if bits 2-3 differ,
otherwise match. If bit 1 is different, declare a match, then copy over
the first base from the sequence which has bit 1 set.
For comparing base-space against colour-space, first check to see if the
colour-space sequence has a known first-base, report a mismatch if the
first base is known and different from the base-space sequence.
Otherwise, the base-space sequence (including first base) needs to be
converted to colour-space, then compared. If you're doing that anyway,
it might make sense to store the converted sequence to make subsequent
comparisons a bit quicker (either as well as the original, or by
deleting the original base-space sequence). There could be a compare()
function that returns the converted (and matching) colour-space
sequence, otherwise returns an invalid packed sequence (e.g. 00...00, or
some other number with a checksum mismatch).
Given that kmer.cpp is fairly small, I should be able to manage
implementing these changes somewhat quickly. However, if any other
classes assume a particular format for the Kmers (e.g. the Read class),
then those will need to be changed as well (ideally so that they no
longer expect a particular format). If a checksum is used, and it is
enforced that ^00 can't happen in a true structure, it should be
possible to identify if any Kmer-modifying classes have this assumption.
Hope this helps,
- David Eccles (gringer)
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users