Aha! I think I've got it! Look at how the 'break' function works; here
is a ghci session I just worked on (just shortened
"Data.Bytestring.Lazy.Char8" to "B" for brevity and clarity):
-- first create a list of bytestrings ("lobs") to work with:
let lobs = map B.pack $ ["HELLO", "My name is Monster Brains",
"BEGIN_DNA", "HELLO_AGAIN", "Don't be afraid"]
break (==B.pack "BEGIN_DNA") lobs
-- evaluates to:
([Chunk "HELLO" Empty,Chunk "My name is Monster Brains" Empty],[Chunk
"BEGIN_DNA" Empty,Chunk "HELLO_AGAIN" Empty,Chunk "Don't be afraid" Empty])
Note that the "BEGIN_DNA" bytestring is the first element of the second
list. When this happens in the Bio.Sequence.Phd code, it creates a list
of bytestrings that is like:
(comment,sd) = ["BLAH BLAH COMMENT STUFF", "BEGIN_DNA", "a 20 100", "t
25 103", ...]
And since "BEGIN_DNA" has length == 1, takeWhile ((==3).length) stops there.
I fixed this by switching line 32 of Phd.hs from:
(comment,sd) = break (==B.pack "BEGIN_DNA") fs
to this:
(comment,bdna:sd) = break (==B.pack "BEGIN_DNA") fs
where "bdna" is just meant to capture that "BEGIN_DNA" bytestring.
When I compile, ghc complains that bdna is defined and not used, but
that is sort of what I intend for now. Other than that, it works well
enough for my purposes.
I know this is a really minor fix, but would you mind taking me through
the normal process of submitting a patch? I just want to learn how this
works. Do people normally just suggest things for you to change via
email, and then you integrate it yourself?
Dan
On 11-06-02 05:23 AM, Ketil Malde wrote:
> Dan Fornika <[email protected]> writes:
>
>> BUT... When I run readPhd from ghci now, I get a IO (Sequence Nuc) with
>> the ID and Comment intact, but the sequence info empty.
>
> Okay.
>
>> I'm a bit surprised, because the type signatures of takeWhile and filter
>> are the same. I've tried replacing (==3).length with something a little
>> more generous, like (/=0).length but there is no change in the output.
>
> Hm, so could there be that there is an empty line immediately after
> BEGIN_DNA? Perhaps the safe alternative would be something like
>
> (dna,rest) = break (==B.pack "BEGIN_DNA") sd
> sdata = filter ((==3).length) . map B.words $ dna
>
>> Would someone mind taking another look at the next couple of lines:
>
> Not at all.
>
>> qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) .
>> B.readInt . (!!1)) $ sdata]
>
> So - this generates the quality values (which is a bytestring of Word8s)
> from reading (readInt) the second word (!!1) on each line, and applying
> fromIntegral to convert each of them to Word8. Since readInt returns
> Maybe (number,rest_of_bytestring), 'maybe err' is used to either extract
> the number, or call 'err'.
>
> It looks kind of noisy, but not too complicated, really.
>
>> in if more_magic then qual `seq` (Seq (compact $ B.unwords (label:fields) )
>
> Checks the magic number, evaluates the quality values (don't want thunks
> pointing into the file to hang around), and builds a Seq structure where
> the header is the read label and additional fields from the PHD file
>
>> (compact $ B.concat $ map head sdata)
>
> Concatenates the sequence itself, which comes as the (single letter)
> words that make up the first column (map head), compact makes sure this
> is stored as a contigous lazy bytestring, not a list of single-letter
> chunks.
>
>> (Just qual))
>
> And, yes, don't forget the quality values.
>
>> (unmodified from the original source) and see if there is some reason
>> that takeWhile will cause the sequence info not to be passed in properly?
>
> No, this stuff works as advertised, the problem is clipping out the
> region that contains the sequence and quality data (and, IIRC, exact
> position in the chromatogram? Which we ignore anyway :-)
>
> Thanks for working on fixing this - let me know if there's anything else
> I can do to help you along.
>
> -k
_______________________________________________
Biohaskell mailing list
[email protected]
http://malde.org/cgi-bin/mailman/listinfo/biohaskell