OK, I've attached a complete program that works, if you want to just get it done, but I've also described what went wrong in your first attempt below.
# the i value was just for debugging, so I dropped it primaryfile = open('/tmp/extract','r') # read the primary file into a list for speed and so you aren't reading more than once primary_lines = primaryfile.readlines() # you didn't specify a mode for this, so it defaulted to read-only. Be explicit for clarity secondaryfile = open('/tmp/unload', 'r') # Open a separate file for output, otherwise you would have been writing and reading the same file over and over again, which usually causes errors outputfile = open('/tmp/result-file', 'w') # read the second file into a list, then you can scan through it over and over without hammering disk and re-reading a file you might have modified. secondary_lines = secondaryfile.readlines() # print is a statement, not a function. print 'opened files' # loop through the list, not the file for line in primary_lines: pcompare = line # print is a statement, use the formatting operator to print variable values print 'primary line = %s' % (pcompare) # loop through the list, not the file for row in secondary_lines: scompare = row if pcompare == scompare: # print as a statement, not a function print 'secondary line = %s' % (scompare) # you were writing random # characters in a file (most likely after the line read), this writes a comment to a new file, which is usually clearer. # invert the test, and add the line to a set here then write out the set at the end to get an output of lines without duplication. outputfile.write('#%s' % (scompare)) print 'Done' Kevin Faulkner wrote: > Sorry about the time issue. > On Friday 27 August 2010 23:50:00 you wrote: >> I hope these are small files, the algorithm you wrote is not going to run >> well as file size gets large (over 10,000 entries) Have you checked the >> space/tab situation? Python uses indentation changes to indicate the end >> of a block, so inconsistent use of tabs and spaces freaks it out. Here are >> a couple questions: > This is not a school project, so you won't be doing my homework or anything :) > The space/tab issue is okay, but the script does not even get to the > print(i), > I even tried for line in secondaryfile: and the for loop still wouldn't be > executed. >> Are these always numbers? > Yes, they are IP's from an Apache error log. >> Do the files have to remain in their original order, or can you reorder >> them during processing? How often does this have to run? > they are not in order because one list is 852 entries and another list is > 3300 > entries. This script only needs to run once. >> Do you have to "comment" the duplicate, or can you remove it? > The plan is to remove it, but I wanted to see if my removal method would > work, > so I was trying to put a comment next to it. >> Are there any other requirements not obvious from the description below? > No real requirements, if anyone would like the original files I can give them > to you, a lot of them are bots. > Thank you :) > -Kevin >> Kevin Faulkner wrote: >>> I was trying to pull duplicates out of 2 different files. Needless to say >>> there are duplicates I would place a # next to the duplicate. Example >>> files: file 1: file 2: >>> 433.3 947.3 >>> 543.1 749.0 >>> 741.1 859.2 >>> 238.5 433.3 >>> 839.2 229.1 >>> 583.6 990.1 >>> 863.4 741.1 >>> 859.2 101.8 >>> >>> import string >>> i=1 >>> primaryfile = open('/tmp/extract','r') >>> secondaryfile = open('/tmp/unload') >>> >>> for line in primaryfile: >>> pcompare = line >>> print(pcompare) >>> >>> for row in secondaryfile: >>> i = i + 1 >>> print(i) >>> scompare = row >>> >>> if pcompare == scompare: >>> print(scompare) >>> secondaryfile.write('#') >>> >>> With this code it should go through the files and find a duplicate and >>> place a '#' next to it. But for some reasonson it doesn't even get to >>> the second for statement. I don't know what else to do. Please offer >>> some assistance. :) --------------------------------------------------- >>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us >>> To subscribe, unsubscribe, or to change your mail settings: >>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss > --------------------------------------------------- > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us > To subscribe, unsubscribe, or to change your mail settings: > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss >
def sort_and_compare_files(extract, unload, clean_extract, clean_unload, clean_combined): try: input_extract = open(extract, 'r') input_unload = open(unload, 'r') output_extract = open(clean_extract, 'w') output_unload = open(clean_unload, 'w') output_combined = open(clean_combined, 'w') extract_set = set(input_extract) unload_set = set(input_unload) extract_unique = extract_set.difference(unload_set) unload_unique = unload_set.difference(extract_set) combined_unique = extract_set.symmetric_difference(unload_set) output_extract.writelines(extract_unique) output_unload.writelines(unload_unique) output_combined.writelines(combined_unique) except IOError: print 'IO Error accessing files' finally: if input_extract != None: input_extract.close() if input_unload != None: input_unload.close() if output_extract != None: output_extract.close() if output_unload != None: output_unload.close() if output_combined != None: output_combined.close() #This code is for debugging and unit testing if __name__ == '__main__': sort_and_compare_files('/tmp/extract', '/tmp/unload', '/tmp/clean-extract', '/tmp/clean-unload', '/tmp/combined')
signature.asc
Description: OpenPGP digital signature
--------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss