Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-16 Thread Alan Gauld
> It's educational to look at the code anyway .  Here it is,
from
> Python's listobject.c:

Its been on my list of things to do for a long time.
The biggest problem with C code is usually finding the
bit you want, unless there is a good roadmap (aka
design document)

> static int
> list_length(PyListObject *a)
> {
> return a->ob_size;
> }

> implementation.  OTOH, people should be afraid to _write_ C code:
it
> will bite you every time <0.9 wink>.

Yep, having run the maintenance team for a 3.5 milion line
project in C/C++ its truly amazing the number of ways you
can screw up in C! Its one of the reasons I now use Python
and leave the C/Java stuff to others...

Alan G.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Tim Peters
[Alan Gauld]
> OK, The timbot's word is good enough for me, I won't bother
> looking at the code, I'll revert to my previous assumption! :-)

It's educational to look at the code anyway .  Here it is, from
Python's listobject.c:

static int
list_length(PyListObject *a)
{
return a->ob_size;
}

People shouldn't be afraid to look at C code -- it won't bite them,
and it's often the easiest way to get a definitive answer about the
implementation.  OTOH, people should be afraid to _write_ C code:  it
will bite you every time <0.9 wink>.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Alan Gauld
> GR> len is actually a field in the underlying C object so len() is a
> GR> constant (O(1)) and as-fast-as-it-can-be operation.
>
TP> ...n integers), but (ignoring the range() complication) there's no
TP> difference in O() behavior between the two.

OK, The timbot's word is good enough for me, I won't bother
looking at the code, I'll revert to my previous assumption! :-)

Alan G.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Alan Gauld
> It this correct? Python lists are not linked-lists (as in Scheme,
for
> example). They are more like arrays (or vectors in C++/Java) with a
> little more sofistication built into them to allow, for example, to
> amortize over time a sequence of append operations. But in a
nutshell,
> len is actually a field in the underlying C object so len() is a
> constant (O(1)) and as-fast-as-it-can-be operation.
>
> Someone correct me, please, If I'm wrong.

I thought that too, but after a recent thread when I said
the same I got an off list mail assuring me that len() was
calculated at run time by iterating over the collection!

I guess I should go and look at the code!
However in either case missing out the two function calls
to len() would help improve the speed...

Alan G.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Tim Peters
[Gonçalo Rodrigues]
> It this correct? Python lists are not linked-lists (as in Scheme, for
> example). They are more like arrays (or vectors in C++/Java) with a
> little more sofistication built into them to allow, for example, to
> amortize over time a sequence of append operations. But in a nutshell,
> len is actually a field in the underlying C object so len() is a
> constant (O(1)) and as-fast-as-it-can-be operation.
>
> Someone correct me, please, If I'm wrong.

That's all correct.  It remains more idiomatic to do

for thing in a_list:
...

than

for i in range(len(a_list)):
thing = a_list[i]
...

and the former is in fact a bit faster (or possibly a lot, if a_list
is very large, because range(n) actually constructs a list containing
n integers), but (ignoring the range() complication) there's no
difference in O() behavior between the two.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Gonçalo Rodrigues
Alan Gauld wrote:
By missing out a loop(*) and some splits it should speed up 
significantly for the cost of some small added complexity in building
 the dictionaries in the first case.

(*)In fact 3 loops because you aren't doing len() which effectively 
loops over the collection too.

It this correct? Python lists are not linked-lists (as in Scheme, for
example). They are more like arrays (or vectors in C++/Java) with a
little more sofistication built into them to allow, for example, to
amortize over time a sequence of append operations. But in a nutshell,
len is actually a field in the underlying C object so len() is a
constant (O(1)) and as-fast-as-it-can-be operation.
Someone correct me, please, If I'm wrong.
Best regards,
G. Rodrigues
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Alan Gauld
>  Thank you for your suggestion. I tried creating a
> dictionary of 'what' list and searched keys with
> has_key method and it is pretty fast. 

Use try/except and it will be faster still. The except only 
gets triggered if theres a missing entry - which in your 
case should be hardly ever! So you save a function call 
per item...

Alan G.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-15 Thread Alan Gauld
> >>>for i in range(len(what)):
> ele = split(what[i],'\t')
> cor1 = ele[0]

It will be faster if you stop using len() in your for loops!

for item in what:

> for k in range(len(my_report)):
> cols = split(my_report[k],'\t')
> cor = cols[0]

And again:
for item in my_report:

> if cor1 == cor:
> print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]

And it will be faster if you stop adding the strings 
together. Each addition creates a new string. Use the 
formatting operation instead:

print "%s\t%s\t%s\t%s" % (cor,ele[1],cols[1],cols[2])

The other thing to consider is using a dictionary.
You are effectively using the cor bit as a key, so if 
instead of creating two lists ytou create two dictionaries
you will avoid all the splitting stuff and have instant access:

for key in what.keys():
  try:
print "%s\t%s\t%s\t%s" % (key, what[key][0], 
  report[key][0],report[key][1])
  except KeyError: pass# or print an error message

By missing out a loop(*) and some splits it should speed 
up significantly for the cost of some small added 
complexity in building the dictionaries in the first case.

(*)In fact 3 loops because you aren't doing len() which 
   effectively loops over the collection too.

HTH,

Alan G
Author of the Learn to Program web tutor
http://www.freenetpages.co.uk/hp/alan.gauld
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-14 Thread kumar s
Hi Danny:
 Thank you for your suggestion. I tried creating a
dictionary of 'what' list and searched keys with
has_key method and it is pretty fast. 

Thanks again. following is the piece of code.

K



>>> cors = []
>>> intr = []
>>> for i in range(len(what)):
ele = split(what[i],'\t')
cors.append(ele[0])
intr.append(ele[1])


>>> what_dict = dict(zip(cors,intr))

>>> for i in range(len(my_report)):
cols = split(my_report[i],'\t')
cor = cols[0]
if what_dict.has_key(cor):
intr = what_dict[cor]

my_final_report.append(cols[0]+'\t'+intr+'\t'+cols[1]+'\t'+cols[2])





--- Danny Yoo <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 14 Jan 2005, kumar s wrote:
> 
> > >>>for i in range(len(what)):
> > ele = split(what[i],'\t')
> > cor1 = ele[0]
> > for k in range(len(my_report)):
> > cols = split(my_report[k],'\t')
> > cor = cols[0]
> > if cor1 == cor:
> > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]
> 
> 
> 
> Hi Kumar,
> 
> 
> Ok, this calls for the use of an "associative map"
> or "dictionary".
> 
> 
> The main time sink is the loop here:
> 
> > for k in range(len(my_report)):
> > cols = split(my_report[k],'\t')
> > cor = cols[0]
> > if cor1 == cor:
> > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]
> 
> Conceptually, my_report can be considered a list of
> key/value pairs.  For
> each element in 'my_report', the "key" is the first
> column (cols[0]), and
> the "value" is the rest of the columns (cols[1:]).
> 
> 
> The loop above can, in a pessimistic world, require
> a search across the
> whole of 'my_report'.  This can take time that is
> proportional to the
> length of 'my_report'.  You mentioned earlier that
> each list might be of
> length 249502, so we're looking into a process whose
> overall cost is
> gigantic.
> 
> 
> [Notes on calculating runtime cost: when the
> structure of the code looks
> like:
> 
> for element1 in list1:
> for element2 in list2:
> some_operation_that_costs_K_time()
> 
> then the overall cost of running this loop will be
> 
> K * len(list1) * len(list2)
> ]
> 
> 
> We can do much better than this if we use a
> "dictionary" data structure. A
> "dictionary" can reduce the time it takes to do a
> lookup search down from
> a linear-time operation to an atomic-time one.  Do
> you know about
> dictionaries yet?  You can take a look at:
> 
> http://www.ibiblio.org/obp/thinkCSpy/chap10.htm
> 
> which will give an overview of a dictionary.  It
> doesn't explain why
> dictionary lookup is fast, but we can talk about
> that later if you want.
> 
> 
> Please feel free to ask any questions about
> dictionaries and their use.
> Learning how to use a dictionary data structure is a
> skill that pays back
> extraordinarily well.
> 
> 
> Good luck!
> 
> 




__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-14 Thread Max Noel
On Jan 14, 2005, at 23:28, kumar s wrote:
for i in range(len(what)):
ele = split(what[i],'\t')
cor1 = ele[0]
for k in range(len(my_report)):
cols = split(my_report[k],'\t')
cor = cols[0]
if cor1 == cor:
print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]
164:623 6649TCATGGCTGACAACCCATCTTGGGA   
484:11  6687ATTATCATCACATGCAGCTTCACGC   
490:339 6759GAATCCGCCAGAACACAGACA   
247:57  6880AGTCCTCGTGGAACTACAACTTCAT   
113:623 6901TCATGGGTGTTCGGCATGAAA   
	Okay, so the idea is, the first column of each row is a key, and you 
want to display only the rows whose key is the first column (key?) of a 
row in my_report, right?

	As Danny said, you should use dictionaries for this, with a structure 
in the lines of:

what = {'164:623': '6649TCATGGCTGACAACCCATCTTGGGA',
'484:11': '6687 ATTATCATCACATGCAGCTTCACGC',
'490:339': '6759GAATCCGCCAGAACACAGACA',
} (etc.)
	Lacking that, as Danny said, nested loops are a huge time sink. Also, 
you should avoid using C-style for loops -- Python-style for loops 
(equivalent to Perl's foreach) are much more elegant (and probably 
faster) in that case. Here's how I would do it with your data 
structures (warning, untested code, test before use):

# First, create a list where each element is one of the keys in 
my_report
# Also, strings have a split() method, which by default splits on any 
whitespace
# (tabs included)
headers = [element.split()[0] for element in my_report]

for element in what:
   # Okay, the nested loop is still (more or less) there, but it occurs 
within a
   # 'in' operator, and is therefore executed in C -- much faster.
   if element.split()[0] in headers:
  print element

	Also, it's shorter -- 4 lines, comments aside. Nevertheless, as Danny 
suggested, an approach using dictionaries would blow this away, 
speed-wise.

Hope that helps,
-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting 
and sweating as you run through my corridors... How can you challenge a 
perfect, immortal machine?"

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Faster procedure to filter two lists . Please help

2005-01-14 Thread Danny Yoo


On Fri, 14 Jan 2005, kumar s wrote:

> >>>for i in range(len(what)):
>   ele = split(what[i],'\t')
>   cor1 = ele[0]
>   for k in range(len(my_report)):
>   cols = split(my_report[k],'\t')
>   cor = cols[0]
>   if cor1 == cor:
>   print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]



Hi Kumar,


Ok, this calls for the use of an "associative map" or "dictionary".


The main time sink is the loop here:

>   for k in range(len(my_report)):
>   cols = split(my_report[k],'\t')
>   cor = cols[0]
>   if cor1 == cor:
>   print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]

Conceptually, my_report can be considered a list of key/value pairs.  For
each element in 'my_report', the "key" is the first column (cols[0]), and
the "value" is the rest of the columns (cols[1:]).


The loop above can, in a pessimistic world, require a search across the
whole of 'my_report'.  This can take time that is proportional to the
length of 'my_report'.  You mentioned earlier that each list might be of
length 249502, so we're looking into a process whose overall cost is
gigantic.


[Notes on calculating runtime cost: when the structure of the code looks
like:

for element1 in list1:
for element2 in list2:
some_operation_that_costs_K_time()

then the overall cost of running this loop will be

K * len(list1) * len(list2)
]


We can do much better than this if we use a "dictionary" data structure. A
"dictionary" can reduce the time it takes to do a lookup search down from
a linear-time operation to an atomic-time one.  Do you know about
dictionaries yet?  You can take a look at:

http://www.ibiblio.org/obp/thinkCSpy/chap10.htm

which will give an overview of a dictionary.  It doesn't explain why
dictionary lookup is fast, but we can talk about that later if you want.


Please feel free to ask any questions about dictionaries and their use.
Learning how to use a dictionary data structure is a skill that pays back
extraordinarily well.


Good luck!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Faster procedure to filter two lists . Please help

2005-01-14 Thread kumar s
Hi group:
I have two lists a. 'my_report' and b. 'what'.

In list 'what', I want to take 6649 (element1:
164:623\t6649) and write to a new list ( although I
printed the result, my 
intension is to list.append(result). 

I took column 1 value of element 1 in what, which is
164:623 and checked in column 1 value in list
my_report, if it matches
I asked it to write the all columns of my_report along
with column 2 value in what list. 

(with my explanation, I feel I made it complex).

Here is what I did:




>>> what[0:4]
['164:623\t6649', '484:11\t6687', '490:339\t6759',
'247:57\t6880', '113:623\t6901']



>>>my_report[0:4]

['164:623\tTCATGGCTGACAACCCATCTTGGGA\t20_s_at',
'484:11\tATTATCATCACATGCAGCTTCACGC\t20_s_at',
'490:339\tGAATCCGCCAGAACACAGACA\t20_s_at',
'247:57\tAGTCCTCGTGGAACTACAACTTCAT\t20_s_at',
'113:623\tTCATGGGTGTTCGGCATGAAA\t20_s_at']







>>>for i in range(len(what)):
ele = split(what[i],'\t')
cor1 = ele[0]
for k in range(len(my_report)):
cols = split(my_report[k],'\t')
cor = cols[0]
if cor1 == cor:
print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2]


164:623 6649TCATGGCTGACAACCCATCTTGGGA   
484:11  6687ATTATCATCACATGCAGCTTCACGC   
490:339 6759GAATCCGCCAGAACACAGACA   
247:57  6880AGTCCTCGTGGAACTACAACTTCAT   
113:623 6901TCATGGGTGTTCGGCATGAAA   




PROBLEM:

This process is very very slow. I have 249502 elements
in each list. The process has been running for over 30
min.  Could 
any one suggest a better fast procedure, to save time.


Thank you in advance. 

K



__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor