On 14 Jan 2005 12:30:57 -0800, Paul Rubin
http://[EMAIL PROTECTED] wrote:
Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations
Cool!
Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.
Jeff Shannon wrote:
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping
Bengt Richter wrote:
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote:
[...]
Others have probably solved your basic problem, or pointed
the way. I'm just curious.
Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good
Jeff Shannon wrote:
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping
In article [EMAIL PROTECTED],
Chris Lasher [EMAIL PROTECTED] wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line
Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on?
Oops, I should have specified this. The script will only need to be run
once every three or four months, when the sequences are updated. I'll
be running it on boxes that are
Chris Lasher wrote:
Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.
Jeff Shannon wrote:
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries)
Chris Lasher wrote:
Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Don't fight it, lite it!
You should parse the fasta and put it into a database:
http://www.sqlite.org/index.html
Then index by name and it will be superfast.
James
--
http://mail.python.org/mailman/listinfo/python-list
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''
(right-angle bracket)
If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer
to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not
You don't say how this will be used, but here goes:
1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data. Use shelve to store the dictionary for subsequent
runs (if load time is excessive).
2) Take a look at Gadfly
RE: What strategy for random accession of records in massive FASTA file?
Batista, Facundo [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
[If you want to keep the memory usage low, you can parse the file once and
store in a list the byte position where the record starts and ends. Then
In article [EMAIL PROTECTED], Chris Lasher wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the
Chris Lasher wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard
format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for
18 matches
Mail list logo