kumar s wrote:

On top of this this process is VERY SLOW on high end
server too.

That's because, for each line in spot_cor, you're examining every item in spot_int, and if there's a match, you examine every element in spot_int again! If I'm remembering my big-O notation correctly, that makes this O(n*m + m) -- even if it simplifies to O(n*m) (which I think it may), that's still quadratic time -- as the length of your lists grows, your performance will degrade *very* rapidly.


Here's a simplified version of your code. I've changed the for loops so that they iterate over the lists directly, instead of iterating over a range and then using that number to index into the list -- the result is the same, but this way is less extra work and easier to read.

    out = open('sa_int_2.txt','w')
    for cor_ele in spot_cor:
        for int_ele in spot_int:
            cols = split(int_ele, '\t')
            y = cols[0] + '\t' + cols[1]
            if x == y:
                for int_ele2 in spot_int:
                    if y in int_ele2:
                        out.write(int_ele2 + '\n')


Remember, every time you use 'in', it means traversing the entire list. Multiple levels of nesting statements that use 'in' causes a polynomial expansion of work.


Now, here's how I'd rewrite your code.

    out = open('sa_int_2.txt','w')
    element_dict = {}
    for line in spot_int:
        cols = split(line,'\t')
        key = '\t'.join(cols[:2])
        element_dict[key] = line
    for cor in spot_cor:
        try:
            value = element_dict[cor]
            out.write(value + '\n')
        except KeyError:
            pass
    out.close()

Note that I iterate through each list only once, so my timing should be O(n + m). First, I preprocess spot_int by putting it into a dictionary. I create keys for the dictionary that are identical to what will be in spot_cor. Then I iterate through spot_cor, and match each element with what's in the dictionary I've made. If I find something that matches, I write it out; if I don't, then I simply move on to the next item in spot_cor.

This does make several assumptions. First, it assumes that every line in spot_int will be unique in the first two columns -- that is, I'm assuming that nowhere in the file would you find lines like the following:

     5     0     123
     5     0     456

If these cases *do* happen, then all but the last such line would be lost, and only the last one would be left in the dictionary (and thus found and written out). In your version, all lines whose first two columns match will be written out. (This may be the source of your "extra" results.)

If you *do* need to handle lines like this, then it wouldn't be hard to modify my code for it. Each value in the dictionary would become a list, and matching lines are appended to that list. The line 'element_dict[key] = line' becomes 'element_dict.setdefault(key, []).append(line)', and writing to the file becomes a for-loop instead of a single write.

The other assumption that I make is that these files are all of reasonable size to fit into memory. Your code makes this same assumption, too, of course. This assumption should be valid for up to a few tens of thousands (maybe even hundreds of thousands, depending on your hardware) of items. If you have more than this, then your best bet would be to start using a real database. (I've heard good things about mysql and sqlite support in Python...)

Jeff Shannon
Technician/Programmer
Credit International



_______________________________________________
Tutor maillist  -  [EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to