Re: [Tutor] Words alignment tool
On Sun, 4 Dec 2005, Srinivas Iyyer wrote: > Contr1SPR-10 SPR-101 SPR-125 SPR-137 SPR-139 SPR-143 > contr2SPR-1 SPR-15 SPR-126 SPR-128 SPR-141 SPR-148 > contr3SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145 > contr4SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148 Hi Srinivas, I'd strongly recommend changing the data representation from a line-oriented to a more structured view. Each line in your data above appears to describe a conceptual set of tuples: (control_number, spr_number) For example, we can think of the line: Contr1 SPR-10 SPR-101 SPR-125 SPR-137 SPR-139 SPR-143 as an encoding for the set of tuples written below (The notation I use below is mathematical and not meant to be interpreted as Python.): { (Contr1, SPR-10), (Contr1, SPR-101), (Contr1, SPR-125), (Contr1, SPR-137), (Contr1, SPR-139), (Contr1, SPR-143) } I'm not sure if I'm seeing everything, but from what I can tell so far, your data cries out to be held in a relational database. I agree with Kent: you do not need to "align" anything. If, within your sequence, each element has to be unique in that sequence, then your "alignment" problem transforms into a simpler table lookup problem. That is, if all your data looks like: 1: A B D E 2: A C F 3: A B C D where no line can have repeated characters, then that data can be transformed into a simple tablular representation, conceptually as: A B C D E F 1 | x | x | | x | x | | 2 | x | | x | | | x | 3 | x | x | x | x | | | So unless there's something here that you're not telling us, there's no need for any complicated alignment algorithms: we just start off with an empty table, and then for each tuple, check the corresponding entry in the table. Then when we need to look for common elements, we just scan across a row or column of the table. BLAST is cool, but, like regular expressions, it's not the answer to every string problem. If you want to implement code to do the above, it's not difficult, but you really should use an SQL database to do this. As a bioinformatician, it would be in your best interest to know SQL, because otherwise, you'll end up trying to reinvent tools that have already been written for you. A good book on introductory relational database usage is "The Practical SQL Handbook: Using Structured Query Language" by Judith Bowman, Sandra Emerson, and Marcy Darnovsky. Good luck to you. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Words alignment tool
Srinivas Iyyer wrote: >Dear Expert programmers, > >I aplogise if this mail is out of context here. > >I have a list of elements like these: > >Contr1 SPR-10 SPR-101 SPR-125 SPR-137 SPR-139 SPR-143 >contr2 SPR-1 SPR-15 SPR-126 SPR-128 SPR-141 SPR-148 >contr3 SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145 >contr4 SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148 > > >There are several common elements prefixed with SPR-. >Although these elements are sorted in asecending order >row wise, the common elements are difficult to spot. >One has to look for common elements by eyeballing. >It would be wonderful if these elements are aligned >properly by inserting gaps. > > I think this is much easier than the bioinformatics problem because your sequence elements are unique and sorted, and you don't have very much data. One approach is to create pairs that look like ('SPR-10', 'Contr1') for all the data. These pairs can be put into one big list and sorted, then grouped by the first element to get what you want. Python 2.4 has the groupby() function which makes it easy to do the grouping. For example: data = '''Contr1SPR-10 SPR-101 SPR-125 SPR-137 SPR-139 SPR-143 contr2 SPR-1 SPR-15 SPR-126 SPR-128 SPR-141 SPR-148 contr3 SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145 contr4 SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148'''.splitlines() import itertools, operator pairs = [] # This will be a list of all the pairs like ('SPR-10', 'Contr1') for line in data: items = line.split() name, items = items[0], items[1:] # now name is the first item on the line, items is a list of all the rest # add the pairs for this line to the main list pairs.extend( (item, name) for item in items) pairs.sort() # Sort the list to bring the first items together # groupby() will return a sequence of key, group pairs where the key is the # first element of the group for k, g in itertools.groupby(pairs, operator.itemgetter(0)): print k, [ name for item, name in g ] The output of this program is SPR-1 ['contr2'] SPR-10 ['Contr1'] SPR-101 ['Contr1'] SPR-106 ['contr3'] SPR-124 ['contr4'] SPR-125 ['Contr1', 'contr4'] SPR-126 ['contr2'] SPR-128 ['contr2'] SPR-130 ['contr3', 'contr4'] SPR-135 ['contr3'] SPR-137 ['Contr1'] SPR-138 ['contr3'] SPR-139 ['Contr1', 'contr3', 'contr4'] SPR-141 ['contr2'] SPR-143 ['Contr1'] SPR-144 ['contr4'] SPR-145 ['contr3'] SPR-148 ['contr2', 'contr4'] SPR-15 ['contr2'] Converting this to a horizontal display is still a little tricky but I'll leave that for you. I should probably explain more about groupby() and itemgetter() but not tonight... Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Words alignment tool
Dear Expert programmers, I aplogise if this mail is out of context here. I have a list of elements like these: Contr1 SPR-10 SPR-101 SPR-125 SPR-137 SPR-139 SPR-143 contr2 SPR-1 SPR-15 SPR-126 SPR-128 SPR-141 SPR-148 contr3 SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145 contr4 SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148 There are several common elements prefixed with SPR-. Although these elements are sorted in asecending order row wise, the common elements are difficult to spot. One has to look for common elements by eyeballing. It would be wonderful if these elements are aligned properly by inserting gaps. In bioinformatics world, this is 100% identical to Protein or DNA alignment. Example: If there are 3 sequences DNA1,2 and 3 with their sequences: DNA1: ATTTAA DNA2: ATAT DNA3: TAATAATAA DNA1 ATTTAA DNA2 A TA T DNA3 TA AtAAT AA These 3 sequences are aligned by introducing gaps. However, in DNA and protein sequence alignments more complex algorithms and treatment is done so as to make a better scoring alignment. However, unfortunately I cannot apply these algorithms/programs to my data, because these programs are made for DNA and protein sequences. I googled for some word matchers. There are programs available however, they align them without itroducing gaps. So ultimately I cannot see the common items clearly lined up (I guess I may be wrong here, it might be better also). My question to the community is, are there any such programs that would generate a multiple alignments on user defined data. I am sure that the idea of multiple algorithms might have been extended to other areas of science, economics or LINGUISTICS. Could any one help me if I can align my data. I have a total of 50 unique words (SPR-1, SPR-2, SPR-3 likewise but no exactly the order and digit). For some Control elements I have 120 such words in a row (consider this of a sequence with 120 words). So if I have to do this in excel I will spend the rest of my happy life doing that :-) However, to show I tried to do that and pasted it below ( derailed completely). So, dear respected members do you have any suggestions of any such programs that I can use in this world of CS. Thank you. S Contr1 SPR-10 SPR-15 SPR-101 SPR-106 SPR-138 SPR-139 SPR-140 SPR-144 SPR-148 contr2 SPR-1 SPR-10 SPR-101 SPR-130 SPR-138 SPR-139 SPR-142 SPR-144 SPR-148 contr3 SPR-15 SPR-16 SPR-17 SPR-106 SPR-130 SPR-135 SPR-139 SPR-144 SPR-181 __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor