Re: [Tutor] Words alignment tool

2005-12-04 Thread Danny Yoo


On Sun, 4 Dec 2005, Srinivas Iyyer wrote:

> Contr1SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
> contr2SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148
> contr3SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
> contr4SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148

Hi Srinivas,

I'd strongly recommend changing the data representation from a
line-oriented to a more structured view.  Each line in your data above
appears to describe a conceptual set of tuples:

(control_number, spr_number)

For example, we can think of the line:

Contr1  SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143

as an encoding for the set of tuples written below (The notation I use
below is mathematical and not meant to be interpreted as Python.):

{ (Contr1, SPR-10),
  (Contr1, SPR-101),
  (Contr1, SPR-125),
  (Contr1, SPR-137),
  (Contr1, SPR-139),
  (Contr1, SPR-143) }

I'm not sure if I'm seeing everything, but from what I can tell so far,
your data cries out to be held in a relational database.  I agree with
Kent: you do not need to "align" anything.  If, within your sequence, each
element has to be unique in that sequence, then your "alignment" problem
transforms into a simpler table lookup problem.


That is, if all your data looks like:

1: A B D E
2: A C F
3: A B C D

where no line can have repeated characters, then that data can be
transformed into a simple tablular representation, conceptually as:


A   B   C   D   E   F
1 | x | x |   | x | x |   |
2 | x |   | x |   |   | x |
3 | x | x | x | x |   |   |


So unless there's something here that you're not telling us, there's no
need for any complicated alignment algorithms: we just start off with an
empty table, and then for each tuple, check the corresponding entry in
the table.

Then when we need to look for common elements, we just scan across a row
or column of the table.  BLAST is cool, but, like regular expressions,
it's not the answer to every string problem.


If you want to implement code to do the above, it's not difficult, but you
really should use an SQL database to do this.  As a bioinformatician, it
would be in your best interest to know SQL, because otherwise, you'll end
up trying to reinvent tools that have already been written for you.

A good book on introductory relational database usage is "The Practical
SQL Handbook: Using Structured Query Language" by Judith Bowman, Sandra
Emerson, and Marcy Darnovsky.


Good luck to you.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Words alignment tool

2005-12-04 Thread Kent Johnson
Srinivas Iyyer wrote:

>Dear Expert programmers, 
>
>I aplogise if this mail is out of context here. 
>
>I have a list of elements like these:
>
>Contr1 SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
>contr2 SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148 
>contr3 SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
>contr4 SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148
>
>
>There are several common elements prefixed with SPR-. 
>Although these elements are sorted in asecending order
>row wise, the common elements are difficult to spot. 
>One has to look for common elements by eyeballing.  
>It would be wonderful if these elements are aligned
>properly by inserting gaps.
>  
>
I think this is much easier than the bioinformatics problem because your 
sequence elements are unique and sorted, and you don't have very much data.

One approach is to create pairs that look like ('SPR-10', 'Contr1') for 
all the data. These pairs can be put into one big list and sorted, then 
grouped by the first element to get what you want. Python 2.4 has the 
groupby() function which makes it easy to do the grouping. For example:

data = '''Contr1SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
contr2  SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148
contr3  SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
contr4  SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148'''.splitlines()

import itertools, operator

pairs = [] # This will be a list of all the pairs like ('SPR-10', 'Contr1')

for line in data:
items = line.split()
name, items = items[0], items[1:]
# now name is the first item on the line, items is a list of all the 
rest
# add the pairs for this line to the main list
pairs.extend( (item, name) for item in items)

pairs.sort()   # Sort the list to bring the first items together

# groupby() will return a sequence of key, group pairs where the key is the
# first element of the group
for k, g in itertools.groupby(pairs, operator.itemgetter(0)):
print k, [ name for item, name in g ]


The output of this program is
SPR-1 ['contr2']
SPR-10 ['Contr1']
SPR-101 ['Contr1']
SPR-106 ['contr3']
SPR-124 ['contr4']
SPR-125 ['Contr1', 'contr4']
SPR-126 ['contr2']
SPR-128 ['contr2']
SPR-130 ['contr3', 'contr4']
SPR-135 ['contr3']
SPR-137 ['Contr1']
SPR-138 ['contr3']
SPR-139 ['Contr1', 'contr3', 'contr4']
SPR-141 ['contr2']
SPR-143 ['Contr1']
SPR-144 ['contr4']
SPR-145 ['contr3']
SPR-148 ['contr2', 'contr4']
SPR-15 ['contr2']

Converting this to a horizontal display is still a little tricky but 
I'll leave that for you.

I should probably explain more about groupby() and itemgetter() but not 
tonight...

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Words alignment tool

2005-12-04 Thread Srinivas Iyyer
Dear Expert programmers, 

I aplogise if this mail is out of context here. 

I have a list of elements like these:

Contr1  SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
contr2  SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148 
contr3  SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
contr4  SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148


There are several common elements prefixed with SPR-. 
Although these elements are sorted in asecending order
row wise, the common elements are difficult to spot. 
One has to look for common elements by eyeballing.  
It would be wonderful if these elements are aligned
properly by inserting gaps.

In bioinformatics world, this is 100% identical to
Protein or DNA alignment. 

Example:
If there are 3 sequences DNA1,2 and 3 with their
sequences:

DNA1: ATTTAA
DNA2: ATAT
DNA3: TAATAATAA


DNA1   ATTTAA
DNA2   A  TA T 
DNA3  TA AtAAT AA


These 3 sequences are aligned  by introducing gaps. 
However, in DNA and protein sequence alignments more
complex algorithms and treatment is done so as to make
a better scoring alignment. 


However, unfortunately  I cannot apply these
algorithms/programs to my data, because these programs
are made for DNA and protein sequences. 

I googled for some word matchers. There are programs
available however, they align them without itroducing
gaps.  So ultimately I cannot see the common items
clearly lined up (I guess I may be wrong here, it
might be better also). 

My question to the community is, are there any such
programs that would generate a multiple alignments on
user defined data. I am sure that the idea of multiple
algorithms might have been extended to other areas of
science, economics or LINGUISTICS.

Could any one help me if I can align my data.  I have
a total of 50 unique words (SPR-1, SPR-2, SPR-3
likewise but no exactly the order and digit).  For
some Control elements I have 120 such words in a row
(consider this of a sequence with 120 words). 
So if I have to do this in excel I will spend the rest
of my happy life doing that :-)

However, to show I tried to do that and pasted it
below ( derailed completely). 

So, dear respected members do you have any suggestions
of any such programs that I can use in this world of
CS. 

Thank you. 

S



Contr1  SPR-10  SPR-15  SPR-101 SPR-106 
SPR-138
SPR-139 SPR-140 SPR-144 SPR-148 

contr2  SPR-1   SPR-10  SPR-101 SPR-130 
SPR-138
SPR-139 SPR-142 SPR-144 SPR-148 

contr3  SPR-15  SPR-16  SPR-17  SPR-106 SPR-130 SPR-135
SPR-139 SPR-144 SPR-181 





__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor