Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Dan Fornika Thu, 02 Jun 2011 11:14:15 -0700

Aha!  I think I've got it! Look at how the 'break' function works; here
is a ghci session I just worked on (just shortened
"Data.Bytestring.Lazy.Char8" to "B" for brevity and clarity):


-- first create a list of bytestrings ("lobs") to work with:

let lobs = map B.pack $ ["HELLO", "My name is Monster Brains",
"BEGIN_DNA", "HELLO_AGAIN", "Don't be afraid"]

break (==B.pack "BEGIN_DNA") lobs

-- evaluates to:

([Chunk "HELLO" Empty,Chunk "My name is Monster Brains" Empty],[Chunk
"BEGIN_DNA" Empty,Chunk "HELLO_AGAIN" Empty,Chunk "Don't be afraid" Empty])

Note that the "BEGIN_DNA" bytestring is the first element of the second
list.  When this happens in the Bio.Sequence.Phd code, it creates a list
of bytestrings that is like:

(comment,sd) = ["BLAH BLAH COMMENT STUFF", "BEGIN_DNA", "a 20 100", "t
25 103", ...]

And since "BEGIN_DNA" has length == 1, takeWhile ((==3).length) stops there.

I fixed this by switching line 32  of Phd.hs from:

(comment,sd) = break (==B.pack "BEGIN_DNA") fs

to this:

(comment,bdna:sd) = break (==B.pack "BEGIN_DNA") fs

where "bdna" is just meant to capture that "BEGIN_DNA" bytestring.

When I compile, ghc complains that bdna is defined and not used, but
that is sort of what I intend for now.  Other than that, it works well
enough for my purposes.

I know this is a really minor fix, but would you mind taking me through
the normal process of submitting a patch?  I just want to learn how this
works. Do people normally just suggest things for you to change via
email, and then you integrate it yourself?

Dan

On 11-06-02 05:23 AM, Ketil Malde wrote:
> Dan Fornika <[email protected]> writes:
> 
>> BUT... When I run readPhd from ghci now, I get a IO (Sequence Nuc) with
>> the ID and Comment intact, but the sequence info empty.
> 
> Okay.
> 
>> I'm a bit surprised, because the type signatures of takeWhile and filter
>> are the same.  I've tried replacing (==3).length with something a little
>> more generous, like (/=0).length but there is no change in the output.
> 
> Hm, so could there be that there is an empty line immediately after
> BEGIN_DNA?  Perhaps the safe alternative would be something like
> 
>    (dna,rest) = break (==B.pack "BEGIN_DNA") sd
>    sdata = filter ((==3).length) . map B.words $ dna
> 
>> Would someone mind taking another look at the next couple of lines:
> 
> Not at all.
> 
>> qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) . 
>> B.readInt . (!!1)) $ sdata]
> 
> So - this generates the quality values (which is a bytestring of Word8s)
> from reading (readInt) the second word (!!1) on each line, and applying
> fromIntegral to convert each of them to Word8.  Since readInt returns
> Maybe (number,rest_of_bytestring), 'maybe err' is used to either extract
> the number, or call 'err'.
> 
> It looks kind of noisy, but not too complicated, really.
> 
>> in  if more_magic then qual `seq` (Seq (compact $ B.unwords (label:fields) )
> 
> Checks the magic number, evaluates the quality values (don't want thunks
> pointing into the file to hang around), and builds a Seq structure where
> the header is the read label and additional fields from the PHD file
> 
>>                                   (compact $ B.concat $ map head sdata)
> 
> Concatenates the sequence itself, which comes as the (single letter)
> words that make up the first column (map head), compact makes sure this
> is stored as a contigous lazy bytestring, not a list of single-letter
> chunks. 
> 
>>                                   (Just qual))
> 
> And, yes, don't forget the quality values.
> 
>> (unmodified from the original source) and see if there is some reason
>> that takeWhile will cause the sequence info not to be passed in properly?
> 
> No, this stuff works as advertised, the problem is clipping out the
> region that contains the sequence and quality data (and, IIRC, exact
> position in the chromatogram?  Which we ignore anyway :-)
> 
> Thanks for working on fixing this - let me know if there's anything else
> I can do to help you along.
> 
> -k

_______________________________________________
Biohaskell mailing list
[email protected]
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Reply via email to