Hello Randall,

It's been a while. I could dig it up, but it would be embarrassing (although the program is still being used). My scripting is like Johnny Cash's guitar playing was: primitive.

Here's the gist of it, although I'm sure (actually certain) that there's nothing here that people on the list don't already know. But let me know if I can clarify anything.

Regards,

Gregory Lypny

Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada

- The raw sequence data is available in a text files, which are 500 MB plus. Later projects involved files of more than a GB. These are what I call flat-file databases in that each record can have a varying number of sub-records (none to thousands) pertaining to other variables. Records follow each other in a long stream and are delimited by characters such as ">>". Variables within records are labelled at the beginning of each line in capitals followed by a colon. After that, any line may contain a further breakdown, which can be delimited any number of ways. Basically a mess, and the files aren't very useful in themselves for aggregating or doing batch searches where there would be many hits. I think they were originally set up (by the NIH? I forget.) to return results for single web-based queries, kind of like finding a card in HyperCard. I don't think it was foreseen that some researches might want to submit a batch of queries (called probe sets) to find whether they had been sequenced.

- Having figured out the record delimiter, it was a question of reading in the file a bit at a time to index the information about each variable, that is, break it down and store it in separate files for searching and extracting later. I experimented with reading in as many complete records as possible (as opposed to lines or characters) so as never to mistakenly cut off or lose part of a record, subject to the constraint that the amount read into what I knew would be the biggest MetaCard variables (what I called indexes) did not exceed a certain size in MB because MetaCard would slow down dramatically after a certain point and, of course, go a slow as molasses if virtual memory was called upon. I don't have the stats handy, but trial and error in setting the reading criteria really pays off.

- The MetaCard variables that were used as indexes to store the various record variables were always simple tab-delimited lists or arrays that would be later converted to lists. These were always created using Repeat-for-each-line loops because this type of loop is very fast. Another thing that increased the speed of indexing was to dump the contents of the indexes to text files intermittently to free up memory. Again trial and error was necessary (for me, anyway) to determine the optimal number of times to write to disk because writing takes time, so you're balancing write time with variable size. On an ancient Mac, the one-time indexing process could take about 40 minutes.

- The next step was to build a simple interface so that searches and extraction of hits could be done. The user imports a list of search terms (such as probe sets); the appropriate index files are read in and the lines matched with the submitted queries. Output files that can be slapped into a spreadsheet are created and stats reported. A batch query of about 500 probe sets across the 100,000 plus DNA sequences used to take about three to seven minutes on an old blue iMac (forgot what those are called).





On Tue, Jan 15, 2008, at 10:47 AM, [EMAIL PROTECTED] wrote:

Gregory, do you have a more detailed study of the architecture of your dna data solution that you would be willing to share... How physically you re storing and manipulating and reporting your data?

randall

_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to