RE: Many Cards Versus One Card and a List Field

Gregory Lypny Tue, 15 Jan 2008 18:14:45 -0800

Hello Randall,

It's been a while. I could dig it up, but it would be embarrassing(although the program is still being used). My scripting is likeJohnny Cash's guitar playing was: primitive.

Here's the gist of it, although I'm sure (actually certain) thatthere's nothing here that people on the list don't already know. Butlet me know if I can clarify anything.


Regards,

Gregory Lypny

Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada

- The raw sequence data is available in a text files, which are 500 MBplus. Later projects involved files of more than a GB. These arewhat I call flat-file databases in that each record can have a varyingnumber of sub-records (none to thousands) pertaining to othervariables. Records follow each other in a long stream and aredelimited by characters such as ">>". Variables within records arelabelled at the beginning of each line in capitals followed by acolon. After that, any line may contain a further breakdown, whichcan be delimited any number of ways. Basically a mess, and the filesaren't very useful in themselves for aggregating or doing batchsearches where there would be many hits. I think they were originallyset up (by the NIH? I forget.) to return results for single web-basedqueries, kind of like finding a card in HyperCard. I don't think itwas foreseen that some researches might want to submit a batch ofqueries (called probe sets) to find whether they had been sequenced.

- Having figured out the record delimiter, it was a question ofreading in the file a bit at a time to index the information abouteach variable, that is, break it down and store it in separate filesfor searching and extracting later. I experimented with reading in asmany complete records as possible (as opposed to lines or characters)so as never to mistakenly cut off or lose part of a record, subject tothe constraint that the amount read into what I knew would be thebiggest MetaCard variables (what I called indexes) did not exceed acertain size in MB because MetaCard would slow down dramatically aftera certain point and, of course, go a slow as molasses if virtualmemory was called upon. I don't have the stats handy, but trial anderror in setting the reading criteria really pays off.

- The MetaCard variables that were used as indexes to store thevarious record variables were always simple tab-delimited lists orarrays that would be later converted to lists. These were alwayscreated using Repeat-for-each-line loops because this type of loop isvery fast. Another thing that increased the speed of indexing was todump the contents of the indexes to text files intermittently to freeup memory. Again trial and error was necessary (for me, anyway) todetermine the optimal number of times to write to disk because writingtakes time, so you're balancing write time with variable size. On anancient Mac, the one-time indexing process could take about 40 minutes.

- The next step was to build a simple interface so that searches andextraction of hits could be done. The user imports a list of searchterms (such as probe sets); the appropriate index files are read inand the lines matched with the submitted queries. Output files thatcan be slapped into a spreadsheet are created and stats reported. Abatch query of about 500 probe sets across the 100,000 plus DNAsequences used to take about three to seven minutes on an old blueiMac (forgot what those are called).

On Tue, Jan 15, 2008, at 10:47 AM, [EMAIL PROTECTED]wrote:

Gregory, do you have a more detailed study of the architecture ofyour dna data solution that you would be willing to share... Howphysically you re storing and manipulating and reporting your data?
randall


_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

RE: Many Cards Versus One Card and a List Field

Reply via email to