Re: [Tutor] Faster procedure to filter two lists . Please help
> It's educational to look at the code anyway . Here it is, from > Python's listobject.c: Its been on my list of things to do for a long time. The biggest problem with C code is usually finding the bit you want, unless there is a good roadmap (aka design document) > static int > list_length(PyListObject *a) > { > return a->ob_size; > } > implementation. OTOH, people should be afraid to _write_ C code: it > will bite you every time <0.9 wink>. Yep, having run the maintenance team for a 3.5 milion line project in C/C++ its truly amazing the number of ways you can screw up in C! Its one of the reasons I now use Python and leave the C/Java stuff to others... Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
[Alan Gauld] > OK, The timbot's word is good enough for me, I won't bother > looking at the code, I'll revert to my previous assumption! :-) It's educational to look at the code anyway . Here it is, from Python's listobject.c: static int list_length(PyListObject *a) { return a->ob_size; } People shouldn't be afraid to look at C code -- it won't bite them, and it's often the easiest way to get a definitive answer about the implementation. OTOH, people should be afraid to _write_ C code: it will bite you every time <0.9 wink>. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
> GR> len is actually a field in the underlying C object so len() is a > GR> constant (O(1)) and as-fast-as-it-can-be operation. > TP> ...n integers), but (ignoring the range() complication) there's no TP> difference in O() behavior between the two. OK, The timbot's word is good enough for me, I won't bother looking at the code, I'll revert to my previous assumption! :-) Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
> It this correct? Python lists are not linked-lists (as in Scheme, for > example). They are more like arrays (or vectors in C++/Java) with a > little more sofistication built into them to allow, for example, to > amortize over time a sequence of append operations. But in a nutshell, > len is actually a field in the underlying C object so len() is a > constant (O(1)) and as-fast-as-it-can-be operation. > > Someone correct me, please, If I'm wrong. I thought that too, but after a recent thread when I said the same I got an off list mail assuring me that len() was calculated at run time by iterating over the collection! I guess I should go and look at the code! However in either case missing out the two function calls to len() would help improve the speed... Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
[Gonçalo Rodrigues] > It this correct? Python lists are not linked-lists (as in Scheme, for > example). They are more like arrays (or vectors in C++/Java) with a > little more sofistication built into them to allow, for example, to > amortize over time a sequence of append operations. But in a nutshell, > len is actually a field in the underlying C object so len() is a > constant (O(1)) and as-fast-as-it-can-be operation. > > Someone correct me, please, If I'm wrong. That's all correct. It remains more idiomatic to do for thing in a_list: ... than for i in range(len(a_list)): thing = a_list[i] ... and the former is in fact a bit faster (or possibly a lot, if a_list is very large, because range(n) actually constructs a list containing n integers), but (ignoring the range() complication) there's no difference in O() behavior between the two. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
Alan Gauld wrote: By missing out a loop(*) and some splits it should speed up significantly for the cost of some small added complexity in building the dictionaries in the first case. (*)In fact 3 loops because you aren't doing len() which effectively loops over the collection too. It this correct? Python lists are not linked-lists (as in Scheme, for example). They are more like arrays (or vectors in C++/Java) with a little more sofistication built into them to allow, for example, to amortize over time a sequence of append operations. But in a nutshell, len is actually a field in the underlying C object so len() is a constant (O(1)) and as-fast-as-it-can-be operation. Someone correct me, please, If I'm wrong. Best regards, G. Rodrigues ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
> Thank you for your suggestion. I tried creating a > dictionary of 'what' list and searched keys with > has_key method and it is pretty fast. Use try/except and it will be faster still. The except only gets triggered if theres a missing entry - which in your case should be hardly ever! So you save a function call per item... Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
> >>>for i in range(len(what)): > ele = split(what[i],'\t') > cor1 = ele[0] It will be faster if you stop using len() in your for loops! for item in what: > for k in range(len(my_report)): > cols = split(my_report[k],'\t') > cor = cols[0] And again: for item in my_report: > if cor1 == cor: > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] And it will be faster if you stop adding the strings together. Each addition creates a new string. Use the formatting operation instead: print "%s\t%s\t%s\t%s" % (cor,ele[1],cols[1],cols[2]) The other thing to consider is using a dictionary. You are effectively using the cor bit as a key, so if instead of creating two lists ytou create two dictionaries you will avoid all the splitting stuff and have instant access: for key in what.keys(): try: print "%s\t%s\t%s\t%s" % (key, what[key][0], report[key][0],report[key][1]) except KeyError: pass# or print an error message By missing out a loop(*) and some splits it should speed up significantly for the cost of some small added complexity in building the dictionaries in the first case. (*)In fact 3 loops because you aren't doing len() which effectively loops over the collection too. HTH, Alan G Author of the Learn to Program web tutor http://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
Hi Danny: Thank you for your suggestion. I tried creating a dictionary of 'what' list and searched keys with has_key method and it is pretty fast. Thanks again. following is the piece of code. K >>> cors = [] >>> intr = [] >>> for i in range(len(what)): ele = split(what[i],'\t') cors.append(ele[0]) intr.append(ele[1]) >>> what_dict = dict(zip(cors,intr)) >>> for i in range(len(my_report)): cols = split(my_report[i],'\t') cor = cols[0] if what_dict.has_key(cor): intr = what_dict[cor] my_final_report.append(cols[0]+'\t'+intr+'\t'+cols[1]+'\t'+cols[2]) --- Danny Yoo <[EMAIL PROTECTED]> wrote: > > > On Fri, 14 Jan 2005, kumar s wrote: > > > >>>for i in range(len(what)): > > ele = split(what[i],'\t') > > cor1 = ele[0] > > for k in range(len(my_report)): > > cols = split(my_report[k],'\t') > > cor = cols[0] > > if cor1 == cor: > > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] > > > > Hi Kumar, > > > Ok, this calls for the use of an "associative map" > or "dictionary". > > > The main time sink is the loop here: > > > for k in range(len(my_report)): > > cols = split(my_report[k],'\t') > > cor = cols[0] > > if cor1 == cor: > > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] > > Conceptually, my_report can be considered a list of > key/value pairs. For > each element in 'my_report', the "key" is the first > column (cols[0]), and > the "value" is the rest of the columns (cols[1:]). > > > The loop above can, in a pessimistic world, require > a search across the > whole of 'my_report'. This can take time that is > proportional to the > length of 'my_report'. You mentioned earlier that > each list might be of > length 249502, so we're looking into a process whose > overall cost is > gigantic. > > > [Notes on calculating runtime cost: when the > structure of the code looks > like: > > for element1 in list1: > for element2 in list2: > some_operation_that_costs_K_time() > > then the overall cost of running this loop will be > > K * len(list1) * len(list2) > ] > > > We can do much better than this if we use a > "dictionary" data structure. A > "dictionary" can reduce the time it takes to do a > lookup search down from > a linear-time operation to an atomic-time one. Do > you know about > dictionaries yet? You can take a look at: > > http://www.ibiblio.org/obp/thinkCSpy/chap10.htm > > which will give an overview of a dictionary. It > doesn't explain why > dictionary lookup is fast, but we can talk about > that later if you want. > > > Please feel free to ask any questions about > dictionaries and their use. > Learning how to use a dictionary data structure is a > skill that pays back > extraordinarily well. > > > Good luck! > > __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
On Jan 14, 2005, at 23:28, kumar s wrote: for i in range(len(what)): ele = split(what[i],'\t') cor1 = ele[0] for k in range(len(my_report)): cols = split(my_report[k],'\t') cor = cols[0] if cor1 == cor: print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] 164:623 6649TCATGGCTGACAACCCATCTTGGGA 484:11 6687ATTATCATCACATGCAGCTTCACGC 490:339 6759GAATCCGCCAGAACACAGACA 247:57 6880AGTCCTCGTGGAACTACAACTTCAT 113:623 6901TCATGGGTGTTCGGCATGAAA Okay, so the idea is, the first column of each row is a key, and you want to display only the rows whose key is the first column (key?) of a row in my_report, right? As Danny said, you should use dictionaries for this, with a structure in the lines of: what = {'164:623': '6649TCATGGCTGACAACCCATCTTGGGA', '484:11': '6687 ATTATCATCACATGCAGCTTCACGC', '490:339': '6759GAATCCGCCAGAACACAGACA', } (etc.) Lacking that, as Danny said, nested loops are a huge time sink. Also, you should avoid using C-style for loops -- Python-style for loops (equivalent to Perl's foreach) are much more elegant (and probably faster) in that case. Here's how I would do it with your data structures (warning, untested code, test before use): # First, create a list where each element is one of the keys in my_report # Also, strings have a split() method, which by default splits on any whitespace # (tabs included) headers = [element.split()[0] for element in my_report] for element in what: # Okay, the nested loop is still (more or less) there, but it occurs within a # 'in' operator, and is therefore executed in C -- much faster. if element.split()[0] in headers: print element Also, it's shorter -- 4 lines, comments aside. Nevertheless, as Danny suggested, an approach using dictionaries would blow this away, speed-wise. Hope that helps, -- Max maxnoel_fr at yahoo dot fr -- ICQ #85274019 "Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine?" ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Faster procedure to filter two lists . Please help
On Fri, 14 Jan 2005, kumar s wrote: > >>>for i in range(len(what)): > ele = split(what[i],'\t') > cor1 = ele[0] > for k in range(len(my_report)): > cols = split(my_report[k],'\t') > cor = cols[0] > if cor1 == cor: > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] Hi Kumar, Ok, this calls for the use of an "associative map" or "dictionary". The main time sink is the loop here: > for k in range(len(my_report)): > cols = split(my_report[k],'\t') > cor = cols[0] > if cor1 == cor: > print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] Conceptually, my_report can be considered a list of key/value pairs. For each element in 'my_report', the "key" is the first column (cols[0]), and the "value" is the rest of the columns (cols[1:]). The loop above can, in a pessimistic world, require a search across the whole of 'my_report'. This can take time that is proportional to the length of 'my_report'. You mentioned earlier that each list might be of length 249502, so we're looking into a process whose overall cost is gigantic. [Notes on calculating runtime cost: when the structure of the code looks like: for element1 in list1: for element2 in list2: some_operation_that_costs_K_time() then the overall cost of running this loop will be K * len(list1) * len(list2) ] We can do much better than this if we use a "dictionary" data structure. A "dictionary" can reduce the time it takes to do a lookup search down from a linear-time operation to an atomic-time one. Do you know about dictionaries yet? You can take a look at: http://www.ibiblio.org/obp/thinkCSpy/chap10.htm which will give an overview of a dictionary. It doesn't explain why dictionary lookup is fast, but we can talk about that later if you want. Please feel free to ask any questions about dictionaries and their use. Learning how to use a dictionary data structure is a skill that pays back extraordinarily well. Good luck! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Faster procedure to filter two lists . Please help
Hi group: I have two lists a. 'my_report' and b. 'what'. In list 'what', I want to take 6649 (element1: 164:623\t6649) and write to a new list ( although I printed the result, my intension is to list.append(result). I took column 1 value of element 1 in what, which is 164:623 and checked in column 1 value in list my_report, if it matches I asked it to write the all columns of my_report along with column 2 value in what list. (with my explanation, I feel I made it complex). Here is what I did: >>> what[0:4] ['164:623\t6649', '484:11\t6687', '490:339\t6759', '247:57\t6880', '113:623\t6901'] >>>my_report[0:4] ['164:623\tTCATGGCTGACAACCCATCTTGGGA\t20_s_at', '484:11\tATTATCATCACATGCAGCTTCACGC\t20_s_at', '490:339\tGAATCCGCCAGAACACAGACA\t20_s_at', '247:57\tAGTCCTCGTGGAACTACAACTTCAT\t20_s_at', '113:623\tTCATGGGTGTTCGGCATGAAA\t20_s_at'] >>>for i in range(len(what)): ele = split(what[i],'\t') cor1 = ele[0] for k in range(len(my_report)): cols = split(my_report[k],'\t') cor = cols[0] if cor1 == cor: print cor+'\t'+ele[1]+'\t'+cols[1]+'\t'+cols[2] 164:623 6649TCATGGCTGACAACCCATCTTGGGA 484:11 6687ATTATCATCACATGCAGCTTCACGC 490:339 6759GAATCCGCCAGAACACAGACA 247:57 6880AGTCCTCGTGGAACTACAACTTCAT 113:623 6901TCATGGGTGTTCGGCATGAAA PROBLEM: This process is very very slow. I have 249502 elements in each list. The process has been running for over 30 min. Could any one suggest a better fast procedure, to save time. Thank you in advance. K __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor