Re: [Tutor] How to list/process files with identical character strings
On Wed, Jun 25, 2014 at 09:47:07PM -0700, Alex Kleider wrote: > Thanks for elucidating this. I didn't know that "several thousand" > would still be considered a small number. On a server, desktop, laptop or notepad, several thousand is not many. My computer can generate a dict with a million items in less than a second and a half: py> with Stopwatch(): ... d = {n: (3*n+2)**4 for n in range(100)} ... time taken: 1.331450 seconds and then process it in under half a second: py> with Stopwatch(): ... x = sum(d[n] for n in range(100)) ... time taken: 0.429471 seconds py> x 1620001340940130 For an embedded device, with perhaps 16 megabytes of RAM, thousands of items is a lot. But for a machine with gigabytes of RAM, it's tiny. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On 2014-06-25 00:35, Wolfgang Maier wrote: On 25.06.2014 00:55, Alex Kleider wrote: I was surprised that the use of dictionaries was suggested, especially since we were told there were many many files. The OP was talking about several thousands of files, which is, of course, too many for manual processing, but is far from an impressive number of elements for a Python dictionary on any modern computer. Dictionaries are fast and efficient and their memory consumption is a factor you will have to think about only in extreme cases (and this is definitely not one of them). What is more, your sequential approach of always comparing a pair of elements hides the fact that you will still have the filenames in memory as a list (at least this is what os.listdir would return) and the difference between that and the proposed dictionary is not that huge. What's more important in my opinion is that while the two approaches may look equally potent for the given example, the dictionary provides more flexibility, i.e., the code is easier to adjust to new problems. Think of the afore-mentioned situation that you could also have three parts of a file instead of two. While your suggestion would have to be rewritten almost from scratch, very little changes would be required to the dictionary-based code. Best, Wolfgang Thanks for elucidating this. I didn't know that "several thousand" would still be considered a small number. If this is the case, then certainly your points are well taken. Gratefully, alex ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On 25.06.2014 00:55, Alex Kleider wrote: I was surprised that the use of dictionaries was suggested, especially since we were told there were many many files. The OP was talking about several thousands of files, which is, of course, too many for manual processing, but is far from an impressive number of elements for a Python dictionary on any modern computer. Dictionaries are fast and efficient and their memory consumption is a factor you will have to think about only in extreme cases (and this is definitely not one of them). What is more, your sequential approach of always comparing a pair of elements hides the fact that you will still have the filenames in memory as a list (at least this is what os.listdir would return) and the difference between that and the proposed dictionary is not that huge. What's more important in my opinion is that while the two approaches may look equally potent for the given example, the dictionary provides more flexibility, i.e., the code is easier to adjust to new problems. Think of the afore-mentioned situation that you could also have three parts of a file instead of two. While your suggestion would have to be rewritten almost from scratch, very little changes would be required to the dictionary-based code. Best, Wolfgang ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
Alex Kleider wrote: > On 2014-06-24 14:01, mark murphy wrote: >> Hi Danny, Marc, Peter and Alex, >> >> Thanks for the responses! Very much appreciated. >> >> I will take these pointers and see what I can pull together. >> >> Thanks again to all of you for taking the time to help! > > > Assuming your files are ordered and therefore one's that need to be > paired will be next to each other, > and that you can get an ordered listing of their names, > here's a suggestion as to the sort of thing that might work: > > f2process = None > for fname in listing: > if not f2process: > f2process = fname > elif to_be_paired(f2process, fname): > process(marry(f2process, fname)) > already_processed = fname > f2process = None > else: > process(f2process) > already_processed = fname > f2process = fname > > if fname != already_processed: > # I'm not sure if 'fname' survives the for/in statement. > # If it doesn't, another approach to not loosing the last file will > be required. > # I hope those more expert will comment. > process(fname) > > > def to_be_paired(f1, f2): > """Returns a boolean: true if the files need to be amalgamated.""" > pass # your code goes here. > > def marry(f1, f2): > """Returns a file object which is a combination of the two files > named by f1 and f2.""" > pass # your code here. > > def process(fname_or_object): > """Accepts either a file name or a file object, Does what you want > done.""" > pass # your code here. > > Comments? > I was surprised that the use of dictionaries was suggested, especially > since we were told there were many many files. (1) 10**6 would be "many files" as in "I don't want to touch them manually", but no problem for the dict approach. "a directory of several thousand daily satellite images" should certainly be managable. (2a) os.listdir() returns a list, so you consume memory proportional to the number of files anyway. (2b) Even if you replace listdir() with a function that generates one filename at a time you cannot safely assume that the names are sorted -- you have to put them in a list to sort them. (3a) Dictionaries are *the* data structure in Python. You should rather be surprised when dict is not proposed for a problem. I might go as far as to say that most of the Python language is syntactic sugar for dicts ;) This leads to (3b) dict-based solutions are usually both efficient and (3c) concise To back 3c here's how I would have written the code if it weren't for educational purposes: directory = "some/directory" files = os.listdir(directory) days = collections.defaultdict(list) for filename in files: days[filename[:8]].append(os.path.join(directory, filename)) for fileset in days.values(): if len(fileset) > 1: print("merging", fileset) But I admit that sort/groupby is also fine: directory = "some/directory" files = os.listdir(directory) files.sort() for _prefix, fileset in itertools.groupby(files, key=lambda name: name[:8]): fileset = list(fileset) if len(fileset) > 1: print("merging", [os.path.join(directory, name) for name in fileset]) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On 2014-06-24 14:01, mark murphy wrote: Hi Danny, Marc, Peter and Alex, Thanks for the responses! Very much appreciated. I will take these pointers and see what I can pull together. Thanks again to all of you for taking the time to help! Assuming your files are ordered and therefore one's that need to be paired will be next to each other, and that you can get an ordered listing of their names, here's a suggestion as to the sort of thing that might work: f2process = None for fname in listing: if not f2process: f2process = fname elif to_be_paired(f2process, fname): process(marry(f2process, fname)) already_processed = fname f2process = None else: process(f2process) already_processed = fname f2process = fname if fname != already_processed: # I'm not sure if 'fname' survives the for/in statement. # If it doesn't, another approach to not loosing the last file will be required. # I hope those more expert will comment. process(fname) def to_be_paired(f1, f2): """Returns a boolean: true if the files need to be amalgamated.""" pass # your code goes here. def marry(f1, f2): """Returns a file object which is a combination of the two files named by f1 and f2.""" pass # your code here. def process(fname_or_object): """Accepts either a file name or a file object, Does what you want done.""" pass # your code here. Comments? I was surprised that the use of dictionaries was suggested, especially since we were told there were many many files. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On 24/06/2014 22:01, mark murphy wrote: Hi Danny, Marc, Peter and Alex, Thanks for the responses! Very much appreciated. I will take these pointers and see what I can pull together. Thanks again to all of you for taking the time to help! Cheers, Mark On Tue, Jun 24, 2014 at 4:39 PM, Danny Yoo mailto:d...@hashcollision.org>> wrote: The sorting approach sounds reasonable. We might even couple it with itertools.groupby() to get the consecutive grouping done for us. https://docs.python.org/2/library/itertools.html#itertools.groupby For example, the following demonstrates that there's a lot that the library will do for us that should apply directly to Mark's problem: # import itertools import random def firstTwoLetters(s): return s[:2] grouped = itertools.groupby( sorted(open('/usr/share/dict/words')), key=firstTwoLetters) for k, g in grouped: print k, list(g)[:5] # In order to really overwhelm you see more_itertools.pairwise here http://pythonhosted.org//more-itertools/api.html as I've found it useful on several occasions. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
Hi Danny, Marc, Peter and Alex, Thanks for the responses! Very much appreciated. I will take these pointers and see what I can pull together. Thanks again to all of you for taking the time to help! Cheers, Mark On Tue, Jun 24, 2014 at 4:39 PM, Danny Yoo wrote: > The sorting approach sounds reasonable. We might even couple it with > itertools.groupby() to get the consecutive grouping done for us. > > https://docs.python.org/2/library/itertools.html#itertools.groupby > > > For example, the following demonstrates that there's a lot that the > library will do for us that should apply directly to Mark's problem: > > # > import itertools > import random > > def firstTwoLetters(s): return s[:2] > > grouped = itertools.groupby( > sorted(open('/usr/share/dict/words')), > key=firstTwoLetters) > > for k, g in grouped: > print k, list(g)[:5] > # > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- Mark S. Murphy Alumnus Department of Geography msmur...@alumni.unc.edu 951-252-4325 ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
The sorting approach sounds reasonable. We might even couple it with itertools.groupby() to get the consecutive grouping done for us. https://docs.python.org/2/library/itertools.html#itertools.groupby For example, the following demonstrates that there's a lot that the library will do for us that should apply directly to Mark's problem: # import itertools import random def firstTwoLetters(s): return s[:2] grouped = itertools.groupby( sorted(open('/usr/share/dict/words')), key=firstTwoLetters) for k, g in grouped: print k, list(g)[:5] # ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On Tue, Jun 24, 2014 at 1:02 PM, Peter Otten <__pete...@web.de> wrote: > Sorting is probably the approach that is easiest to understand, but an > alternative would be to put the files into a dict that maps the 8-char > prefix to a list of files with that prefix: > I was debating the virtues of the two approaches, but figured I'd err on the side of simplicity... ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
Hi Mark, Part of the problem statement sounds a little unusual to me, so I need to push on it to confirm. How do we know that there are only two files at a time that we need to manage? The naming convention described in the problem: --- The naming convention of the files is as follows: TDDDHHMMSS, where: T= one character satellite code = 4 digit year DDD = Julian date HH = 2-digit hour MM = 2-digit minute SS = 2-digit second --- allows for multiple collisions on the key TDDD. But without additional information, having more than two collisions seems a likely possibility to me! is there some other convention in play that prevents >2 collisions from occurring? The real world can be a bit dirty, so what happens if there are more? Is that an error? Good luck to you! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
Peter Otten wrote: > for fileset in days.values(): > if len(fileset) > 1: > # process only the list with one or more files That should have been # process only the lists with two or more files > print("merging", fileset) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
mark murphy wrote: > Hello Python Tutor Community, > > This is my first post and I am just getting started with Python, so I > apologize in advance for any lack of etiquette. > > I have a directory of several thousand daily satellite images that I need > to process. Approximately 300 of these images are split in half, so in > just these instances there will be two files for one day. I need to merge > each pair of split images into one image. > > The naming convention of the files is as follows: TDDDHHMMSS, where: > T= one character satellite code > = 4 digit year > DDD = Julian date > HH = 2-digit hour > MM = 2-digit minute > SS = 2-digit second > > What I hope to be able to do is scan the directory, and for each instance > where there are two files where the first 8 characters (TDDD) are > identical, run a process on those two files and place the output (named > TDDD) in a new directory. > > The actual processing part should be easy enough for me to figure out. > The part about finding the split files (each pair of files with the same > first > 8 characters) and setting those up to be processed is way beyond me. I've > done several searches for examples and have not been able to find what I > am looking for. Sorting is probably the approach that is easiest to understand, but an alternative would be to put the files into a dict that maps the 8-char prefix to a list of files with that prefix: directory = "/some/directory" files = os.listdir(directory) days = {} for filename in files: prefix = filename[:8] filepath = os.path.join(directory, filename) if prefix in days: # add file to the existing list days[prefix].append(filepath) else: # add a new list with one file days[prefix] = [filepath] for fileset in days.values(): if len(fileset) > 1: # process only the list with one or more files print("merging", fileset) (The if prefix in days: days[prefix].append(filepath) else: days[prefix] = [filepath] part can be simplified with the dict.setdefault() method or a collections.defaultdict) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On Tue, Jun 24, 2014 at 8:34 AM, mark murphy wrote: > What I hope to be able to do is scan the directory, and for each instance > where there are two files where the first 8 characters (TDDD) are > identical, run a process on those two files and place the output (named > TDDD) in a new directory. > > I don't know the details of your file system, but I would guess that those files would have some sort of signifier to indicate "this file is the first part of a multi-part image"; "this file is the second part", etc. - maybe the first half has the extension ".001" and the second half has the extension ".002"? If so, I would search for files with the "first part" signifier, and for each one I found I would try to join it with a file with the same base name but the "second part" signifier. If, on the other hand, there's no signifier - just the same date but with a slightly-different timestamp, you can: 1) grab the list of filenames 2) sort it 3) iterate through the list and compare each filename with the previous filename; if the first 8 characters match, you do your processing magic; if not, you move on. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to list/process files with identical character strings
On 2014-06-24 08:34, mark murphy wrote: Hello Python Tutor Community, The actual processing part should be easy enough for me to figure out. The part about finding the split files (each pair of files with the same first 8 characters) and setting those up to be processed is way beyond me. I've done several searches for examples and have not been able to find what I am looking for. Since your file system probably already keeps them ordered, each pair will be next to each other. It would seem a simple matter to compare each file name to the one after it and if they match, process the two together. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] How to list/process files with identical character strings
Hello Python Tutor Community, This is my first post and I am just getting started with Python, so I apologize in advance for any lack of etiquette. I have a directory of several thousand daily satellite images that I need to process. Approximately 300 of these images are split in half, so in just these instances there will be two files for one day. I need to merge each pair of split images into one image. The naming convention of the files is as follows: TDDDHHMMSS, where: T= one character satellite code = 4 digit year DDD = Julian date HH = 2-digit hour MM = 2-digit minute SS = 2-digit second What I hope to be able to do is scan the directory, and for each instance where there are two files where the first 8 characters (TDDD) are identical, run a process on those two files and place the output (named TDDD) in a new directory. The actual processing part should be easy enough for me to figure out. The part about finding the split files (each pair of files with the same first 8 characters) and setting those up to be processed is way beyond me. I've done several searches for examples and have not been able to find what I am looking for. Can anyone help? Thanks so much! Mark -- Mark S. Murphy Alumnus Department of Geography msmur...@alumni.unc.edu 951-252-4325 ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor