Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Bengt Richter
On Sat, 15 Jan 2005 15:24:56 -0500, Steve Holden <[EMAIL PROTECTED]> wrote: >Bulba! wrote: > >> On 14 Jan 2005 12:30:57 -0800, Paul Rubin >> wrote: >> >> >>>Mmap lets you treat a disk file as an array, so you can randomly >>>access the bytes in the file without having

Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Chris Lasher
Roy, thank you for your reply. I have BioPython installed on my box at work and have been browsing through the code in there, some of which I can follow, and most of which will take more time and experience for me to do so. I have considered BioPython and databases, and have chosen to forego those

Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Steve Holden
Bulba! wrote: On 14 Jan 2005 12:30:57 -0800, Paul Rubin wrote: Mmap lets you treat a disk file as an array, so you can randomly access the bytes in the file without having to do seek operations Cool! Just say a[234]='x' and you've changed byte 234 of the file to the le

Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Michael Hoffman
Chris Lasher wrote: I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. I just came across this thread today and I don't understand why you are trying to reinvent the wheel instead of using Biopython which already has a solution to this problem, among o

Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Bulba!
On 14 Jan 2005 12:30:57 -0800, Paul Rubin wrote: >Mmap lets you treat a disk file as an array, so you can randomly >access the bytes in the file without having to do seek operations Cool! >Just say a[234]='x' and you've changed byte 234 of the file to the >letter x.

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Roy Smith
In article <[EMAIL PROTECTED]>, "Chris Lasher" <[EMAIL PROTECTED]> wrote: > Hello, > I have a rather large (100+ MB) FASTA file from which I need to > access records in a random order. The FASTA format is a standard format > for storing molecular biological sequences. Each record contains a > hea

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping th

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Bengt Richter wrote: On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <[EMAIL PROTECTED]> wrote: [...] Others have probably solved your basic problem, or pointed the way. I'm just curious. Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a g

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Paul Rubin
"Chris Lasher" <[EMAIL PROTECTED]> writes: > Forgive my ignorance, but what does using mmap do for the script? My > guess is that it improves performance, but I'm not sure how. I read the > module documentation and the module appears to be a way to read out > information from memory (RAM maybe?).

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Chris Lasher
Forgive my ignorance, but what does using mmap do for the script? My guess is that it improves performance, but I'm not sure how. I read the module documentation and the module appears to be a way to read out information from memory (RAM maybe?). -- http://mail.python.org/mailman/listinfo/python-

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Michael Maibaum
On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote: Jeff Shannon wrote: (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (

Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Neil Benn
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping th

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread John Lenton
On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote: > Chris Lasher wrote: > > > Since the file I'm working with contains tens of thousands of these > > records, I believe I need to find a way to hash this file such that I > > can retrieve the respective sequence more quickly than I coul

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Robert Kern
Jeff Shannon wrote: (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries) That

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream.

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
>And besides, for long-term archiving purposes, I'd expect that zip et >al on a character-stream would provide significantly better >compression than a 4:1 packed format, and that zipping the packed >format wouldn't be all that much more efficient than zipping the >character stream. This 105MB FAS

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote: Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
>Others have probably solved your basic problem, or pointed >the way. I'm just curious. >Given that the information content is 2 bits per character >that is taking up 8 bits of storage, there must be a good reason >for storing and/or transmitting them this way? I.e., it it easy >to think up a coun

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
Thanks for your reply, Larry. I thought about this, but I'm worried the dictionary will consume a lot of resources. I think my 3GHz/1GB RAM box could handle the load fine, but I'm not sure about others' systems. Chris -- http://mail.python.org/mailman/listinfo/python-list

Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
>Before you get too carried away, how often do you want to do this and >how grunty is the box you will be running on? Oops, I should have specified this. The script will only need to be run once every three or four months, when the sequences are updated. I'll be running it on boxes that are 3GHz/1

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Bengt Richter
On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <[EMAIL PROTECTED]> wrote: >Hello, >I have a rather large (100+ MB) FASTA file from which I need to >access records in a random order. The FASTA format is a standard format >for storing molecular biological sequences. Each record contains a >header lin

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Machin
Chris Lasher wrote: > Hello, > I have a rather large (100+ MB) FASTA file from which I need to > access records in a random order. The FASTA format is a standard format > for storing molecular biological sequences. Each record contains a > header line for describing the sequence that begins with a

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread David E. Konerding DSD staff
In article <[EMAIL PROTECTED]>, Chris Lasher wrote: > Hello, > I have a rather large (100+ MB) FASTA file from which I need to > access records in a random order. The FASTA format is a standard format > for storing molecular biological sequences. Each record contains a > header line for describing

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Terry Reedy
RE: What strategy for random accession of records in massive FASTA file? "Batista, Facundo" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] [If you want to keep the memory usage low, you can parse the file once and store in a list the byte position where the record

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Larry Bates
You don't say how this will be used, but here goes: 1) Read the records and put into dictionary with key of sequence (from header) and data being the sequence data. Use shelve to store the dictionary for subsequent runs (if load time is excessive). 2) Take a look at Gadfly (gadfly.sourceforge.net)

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Lenton
> If you could help me figure out how to code a solution > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to > keep it in Python only, even though I know interaction with a > relational database would provide the fastest method--the group I'm > trying to write this for does not

What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Chris Lasher
Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '>' (right-angle bracket) foll

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread James Stroud
Don't fight it, lite it! You should parse the fasta and put it into a database: http://www.sqlite.org/index.html Then index by name and it will be superfast. James -- http://mail.python.org/mailman/listinfo/python-list

Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Fredrik Lundh
Chris Lasher wrote: > Since the file I'm working with contains tens of thousands of these > records, I believe I need to find a way to hash this file such that I > can retrieve the respective sequence more quickly than I could by > parsing through the file request-by-request. However, I'm very new