Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Bulba!
On 14 Jan 2005 12:30:57 -0800, Paul Rubin http://[EMAIL PROTECTED] wrote: Mmap lets you treat a disk file as an array, so you can randomly access the bytes in the file without having to do seek operations Cool! Just say a[234]='x' and you've changed byte 234 of the file to the letter x.

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Neil Benn
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Bengt Richter wrote: On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote: [...] Others have probably solved your basic problem, or pointed the way. I'm just curious. Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Roy Smith
In article [EMAIL PROTECTED], Chris Lasher [EMAIL PROTECTED] wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
Before you get too carried away, how often do you want to do this and how grunty is the box you will be running on? Oops, I should have specified this. The script will only need to be run once every three or four months, when the sequences are updated. I'll be running it on boxes that are

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote: Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream.

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Robert Kern
Jeff Shannon wrote: (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries)

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Fredrik Lundh
Chris Lasher wrote: Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread James Stroud
Don't fight it, lite it! You should parse the fasta and put it into a database: http://www.sqlite.org/index.html Then index by name and it will be superfast. James -- http://mail.python.org/mailman/listinfo/python-list

What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Chris Lasher
Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket)

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Lenton
If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Larry Bates
You don't say how this will be used, but here goes: 1) Read the records and put into dictionary with key of sequence (from header) and data being the sequence data. Use shelve to store the dictionary for subsequent runs (if load time is excessive). 2) Take a look at Gadfly

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Terry Reedy
RE: What strategy for random accession of records in massive FASTA file? Batista, Facundo [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] [If you want to keep the memory usage low, you can parse the file once and store in a list the byte position where the record starts and ends. Then

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread David E. Konerding DSD staff
In article [EMAIL PROTECTED], Chris Lasher wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Machin
Chris Lasher wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a ''

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Bengt Richter
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for