In article <[EMAIL PROTECTED]>, "Chris Lasher" <[EMAIL PROTECTED]> wrote:
> Hello, > I have a rather large (100+ MB) FASTA file from which I need to > access records in a random order. The FASTA format is a standard format > for storing molecular biological sequences. Each record contains a > header line for describing the sequence that begins with a '>' > (right-angle bracket) followed by lines that contain the actual > sequence data. Three example FASTA records are below: > > >CW127_A01 > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA > GCATTAAACAT > >CW127_A02 > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA > GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG > >CW127_A03 > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA > GCATTAAACATTCCGCCTGGG > ... > > Since the file I'm working with contains tens of thousands of these > records, I believe I need to find a way to hash this file such that I > can retrieve the respective sequence more quickly than I could by > parsing through the file request-by-request. First, before embarking on any major project, take a look at http://www.biopython.org/ to at least familiarize yourself with what other people have done in the field. The easiest thing I think would be to use the gdbm module. You can write a simple parser to parse the FASTA file (or, I would imagine, find one already written on biopython), and then store the data in a gdbm map, using the tag lines as the keys and the sequences as the values. Even for a Python neophyte, this should be a pretty simple project. The most complex part might getting the gdbm module built if your copy of Python doesn't already have it, but gdbm is so convenient, it's worth the effort. -- http://mail.python.org/mailman/listinfo/python-list