Re: [Tutor] How to list/process files with identical character strings

2014-06-26 Thread Steven D'Aprano
On Wed, Jun 25, 2014 at 09:47:07PM -0700, Alex Kleider wrote:

> Thanks for elucidating this.  I didn't know that "several thousand" 
> would still be considered a small number.

On a server, desktop, laptop or notepad, several thousand is not many. 
My computer can generate a dict with a million items in less than a 
second and a half:

py> with Stopwatch():
... d = {n: (3*n+2)**4 for n in range(100)}
...
time taken: 1.331450 seconds


and then process it in under half a second:

py> with Stopwatch():
... x = sum(d[n] for n in range(100))
...
time taken: 0.429471 seconds
py> x
1620001340940130

For an embedded device, with perhaps 16 megabytes of RAM, thousands of 
items is a lot. But for a machine with gigabytes of RAM, it's tiny.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-25 Thread Alex Kleider

On 2014-06-25 00:35, Wolfgang Maier wrote:

On 25.06.2014 00:55, Alex Kleider wrote:


I was surprised that the use of dictionaries was suggested, especially
since we were told there were many many files.



The OP was talking about several thousands of files, which is, of
course, too many for manual processing, but is far from an impressive
number of elements for a Python dictionary on any modern computer.
Dictionaries are fast and efficient and their memory consumption is a
factor you will have to think about only in extreme cases (and this is
definitely not one of them). What is more, your sequential approach of
always comparing a pair of elements hides the fact that you will still
have the filenames in memory as a list (at least this is what
os.listdir would return) and the difference between that and the
proposed dictionary is not that huge.

What's more important in my opinion is that while the two approaches
may look equally potent for the given example, the dictionary provides
more flexibility, i.e., the code is easier to adjust to new problems.
Think of the afore-mentioned situation that you could also have three
parts of a file instead of two. While your suggestion would have to be
rewritten almost from scratch, very little changes would be required
to the dictionary-based code.

Best,
Wolfgang


Thanks for elucidating this.  I didn't know that "several thousand" 
would still be considered a small number.  If this is the case, then 
certainly your points are well taken.

Gratefully,
alex
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-25 Thread Wolfgang Maier

On 25.06.2014 00:55, Alex Kleider wrote:


I was surprised that the use of dictionaries was suggested, especially
since we were told there were many many files.



The OP was talking about several thousands of files, which is, of 
course, too many for manual processing, but is far from an impressive 
number of elements for a Python dictionary on any modern computer.
Dictionaries are fast and efficient and their memory consumption is a 
factor you will have to think about only in extreme cases (and this is 
definitely not one of them). What is more, your sequential approach of 
always comparing a pair of elements hides the fact that you will still 
have the filenames in memory as a list (at least this is what os.listdir 
would return) and the difference between that and the proposed 
dictionary is not that huge.


What's more important in my opinion is that while the two approaches may 
look equally potent for the given example, the dictionary provides more 
flexibility, i.e., the code is easier to adjust to new problems. Think 
of the afore-mentioned situation that you could also have three parts of 
a file instead of two. While your suggestion would have to be rewritten 
almost from scratch, very little changes would be required to the 
dictionary-based code.


Best,
Wolfgang

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-25 Thread Peter Otten
Alex Kleider wrote:

> On 2014-06-24 14:01, mark murphy wrote:
>> Hi Danny, Marc, Peter and Alex,
>> 
>> Thanks for the responses!  Very much appreciated.
>> 
>> I will take these pointers and see what I can pull together.
>> 
>> Thanks again to all of you for taking the time to help!
> 
> 
> Assuming your files are ordered and therefore one's that need to be
> paired will be next to each other,
> and that you can get an ordered listing of their names,
> here's a suggestion as to the sort of thing that might work:
> 
> f2process = None
> for fname in listing:
>  if not f2process:
>  f2process = fname
>  elif to_be_paired(f2process, fname):
>  process(marry(f2process, fname))
>  already_processed = fname
>  f2process = None
>  else:
>  process(f2process)
>  already_processed = fname
>  f2process = fname
> 
> if fname != already_processed:
>  # I'm not sure if 'fname' survives the for/in statement.
>  # If it doesn't, another approach to not loosing the last file will
> be required.
>  # I hope those more expert will comment.
>  process(fname)
> 
> 
> def to_be_paired(f1, f2):
>  """Returns a boolean: true if the files need to be amalgamated."""
>  pass  # your code goes here.
> 
> def marry(f1, f2):
>  """Returns a file object which is a combination of the two files
> named by f1 and f2."""
>  pass  # your code here.
> 
> def process(fname_or_object):
>  """Accepts either a file name or a file object, Does what you want
> done."""
>  pass  # your code here.
> 
> Comments?
> I was surprised that the use of dictionaries was suggested, especially
> since we were told there were many many files.

(1) 10**6 would be "many files" as in "I don't want to touch them manually",
but no problem for the dict approach. "a directory of several thousand daily 
satellite images" should certainly be managable.

(2a) os.listdir() returns a list, so you consume memory proportional to the
number of files anyway.

(2b) Even if you replace listdir() with a function that generates one 
filename at a time you cannot safely assume that the names are sorted 
-- you have to put them in a list to sort them.

(3a) Dictionaries are *the* data structure in Python. You should rather be 
surprised when dict is not proposed for a problem. I might go as far as to 
say that most of the Python language is syntactic sugar for dicts ;) This 
leads to

(3b) dict-based solutions are usually both efficient and 

(3c) concise

To back 3c here's how I would have written the code if it weren't for 
educational purposes:

directory = "some/directory"
files = os.listdir(directory)
days = collections.defaultdict(list)

for filename in files:
days[filename[:8]].append(os.path.join(directory, filename))

for fileset in days.values():
if len(fileset) > 1:
print("merging", fileset)

But I admit that sort/groupby is also fine:

directory = "some/directory"
files = os.listdir(directory)
files.sort()

for _prefix, fileset in itertools.groupby(files, key=lambda name: name[:8]):
fileset = list(fileset)
if len(fileset) > 1:
print("merging", [os.path.join(directory, name) for name in 
fileset])


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Alex Kleider

On 2014-06-24 14:01, mark murphy wrote:

Hi Danny, Marc, Peter and Alex,

Thanks for the responses!  Very much appreciated.

I will take these pointers and see what I can pull together.

Thanks again to all of you for taking the time to help!



Assuming your files are ordered and therefore one's that need to be 
paired will be next to each other,

and that you can get an ordered listing of their names,
here's a suggestion as to the sort of thing that might work:

f2process = None
for fname in listing:
if not f2process:
f2process = fname
elif to_be_paired(f2process, fname):
process(marry(f2process, fname))
already_processed = fname
f2process = None
else:
process(f2process)
already_processed = fname
f2process = fname

if fname != already_processed:
# I'm not sure if 'fname' survives the for/in statement.
# If it doesn't, another approach to not loosing the last file will 
be required.

# I hope those more expert will comment.
process(fname)


def to_be_paired(f1, f2):
"""Returns a boolean: true if the files need to be amalgamated."""
pass  # your code goes here.

def marry(f1, f2):
"""Returns a file object which is a combination of the two files 
named by f1 and f2."""

pass  # your code here.

def process(fname_or_object):
"""Accepts either a file name or a file object, Does what you want 
done."""

pass  # your code here.

Comments?
I was surprised that the use of dictionaries was suggested, especially 
since we were told there were many many files.




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Mark Lawrence

On 24/06/2014 22:01, mark murphy wrote:

Hi Danny, Marc, Peter and Alex,

Thanks for the responses!  Very much appreciated.

I will take these pointers and see what I can pull together.

Thanks again to all of you for taking the time to help!

Cheers,
Mark


On Tue, Jun 24, 2014 at 4:39 PM, Danny Yoo mailto:d...@hashcollision.org>> wrote:

The sorting approach sounds reasonable.  We might even couple it with
itertools.groupby() to get the consecutive grouping done for us.

https://docs.python.org/2/library/itertools.html#itertools.groupby


For example, the following demonstrates that there's a lot that the
library will do for us that should apply directly to Mark's problem:

#
import itertools
import random

def firstTwoLetters(s): return s[:2]

grouped = itertools.groupby(
 sorted(open('/usr/share/dict/words')),
 key=firstTwoLetters)

for k, g in grouped:
 print k, list(g)[:5]
#


In order to really overwhelm you see more_itertools.pairwise here 
http://pythonhosted.org//more-itertools/api.html as I've found it useful 
on several occasions.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread mark murphy
Hi Danny, Marc, Peter and Alex,

Thanks for the responses!  Very much appreciated.

I will take these pointers and see what I can pull together.

Thanks again to all of you for taking the time to help!

Cheers,
Mark


On Tue, Jun 24, 2014 at 4:39 PM, Danny Yoo  wrote:

> The sorting approach sounds reasonable.  We might even couple it with
> itertools.groupby() to get the consecutive grouping done for us.
>
> https://docs.python.org/2/library/itertools.html#itertools.groupby
>
>
> For example, the following demonstrates that there's a lot that the
> library will do for us that should apply directly to Mark's problem:
>
> #
> import itertools
> import random
>
> def firstTwoLetters(s): return s[:2]
>
> grouped = itertools.groupby(
> sorted(open('/usr/share/dict/words')),
> key=firstTwoLetters)
>
> for k, g in grouped:
> print k, list(g)[:5]
> #
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 
Mark S. Murphy
Alumnus
Department of Geography
msmur...@alumni.unc.edu
951-252-4325
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Danny Yoo
The sorting approach sounds reasonable.  We might even couple it with
itertools.groupby() to get the consecutive grouping done for us.

https://docs.python.org/2/library/itertools.html#itertools.groupby


For example, the following demonstrates that there's a lot that the
library will do for us that should apply directly to Mark's problem:

#
import itertools
import random

def firstTwoLetters(s): return s[:2]

grouped = itertools.groupby(
sorted(open('/usr/share/dict/words')),
key=firstTwoLetters)

for k, g in grouped:
print k, list(g)[:5]
#
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Marc Tompkins
On Tue, Jun 24, 2014 at 1:02 PM, Peter Otten <__pete...@web.de> wrote:

> Sorting is probably the approach that is easiest to understand, but an
> alternative would be to put the files into a dict that maps the 8-char
> prefix to a list of files with that prefix:
>

I was debating the virtues of the two approaches, but figured I'd err on
the side of simplicity...
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Danny Yoo
Hi Mark,

Part of the problem statement sounds a little unusual to me, so I need
to push on it to confirm.  How do we know that there are only two
files at a time that we need to manage?

The naming convention described in the problem:

---
The naming convention of the files is as follows: TDDDHHMMSS, where:
T= one character satellite code
 = 4 digit year
DDD = Julian date
HH = 2-digit hour
MM = 2-digit minute
SS = 2-digit second
---

allows for multiple collisions on the key TDDD.  But without
additional information, having more than two collisions seems a likely
possibility to me!

is there some other convention in play that prevents >2 collisions
from occurring?  The real world can be a bit dirty, so what happens if
there are more?  Is that an error?


Good luck to you!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Peter Otten
Peter Otten wrote:

> for fileset in days.values():
> if len(fileset) > 1:
> # process only the list with one or more files

That should have been

  # process only the lists with two or more files

> print("merging", fileset)


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Peter Otten
mark murphy wrote:

> Hello Python Tutor Community,
> 
> This is my first post and I am just getting started with Python, so I
> apologize in advance for any lack of etiquette.
> 
> I have a directory of several thousand daily satellite images that I need
> to process.  Approximately 300 of these images are split in half, so in
> just these instances there will be two files for one day.  I need to merge
> each pair of split images into one image.
> 
> The naming convention of the files is as follows: TDDDHHMMSS, where:
> T= one character satellite code
>  = 4 digit year
> DDD = Julian date
> HH = 2-digit hour
> MM = 2-digit minute
> SS = 2-digit second
> 
> What I hope to be able to do is scan the directory, and for each instance
> where there are two files where the first 8 characters (TDDD) are
> identical, run a process on those two files and place the output (named
> TDDD) in a new directory.
> 
> The actual processing part should be easy enough for me to figure out. 
> The part about finding the split files (each pair of files with the same
> first
> 8 characters) and setting those up to be processed is way beyond me.  I've
> done several searches for examples and have not been able to find what I
> am looking for.

Sorting is probably the approach that is easiest to understand, but an 
alternative would be to put the files into a dict that maps the 8-char 
prefix to a list of files with that prefix:

directory = "/some/directory"
files = os.listdir(directory)
days = {}
for filename in files:
prefix = filename[:8]
filepath = os.path.join(directory, filename)
if prefix in days:
# add file to the existing list
days[prefix].append(filepath)
else:
# add a new list with one file
days[prefix] = [filepath]

for fileset in days.values():
if len(fileset) > 1:
# process only the list with one or more files
print("merging", fileset)

(The

if prefix in days:
days[prefix].append(filepath)
else:
days[prefix] = [filepath]

part can be simplified with the dict.setdefault() method or a 
collections.defaultdict)


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Marc Tompkins
On Tue, Jun 24, 2014 at 8:34 AM, mark murphy 
wrote:



> What I hope to be able to do is scan the directory, and for each instance
> where there are two files where the first 8 characters (TDDD) are
> identical, run a process on those two files and place the output (named
> TDDD) in a new directory.
>
>
I don't know the details of your file system, but I would guess that those
files would have some sort of signifier to indicate "this file is the first
part of a multi-part image"; "this file is the second part", etc. - maybe
the first half has the extension ".001" and the second half has the
extension ".002"?  If so, I would search for files with the "first part"
signifier, and for each one I found I would try to join it with a file with
the same base name but the "second part" signifier.

If, on the other hand, there's no signifier - just the same date but with a
slightly-different timestamp, you can:
1) grab the list of filenames
2) sort it
3) iterate through the list and compare each filename with the previous
filename; if the first 8 characters match, you do your processing magic; if
not, you move on.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to list/process files with identical character strings

2014-06-24 Thread Alex Kleider

On 2014-06-24 08:34, mark murphy wrote:

Hello Python Tutor Community,


The actual processing part should be easy enough for me to figure out.  
The
part about finding the split files (each pair of files with the same 
first
8 characters) and setting those up to be processed is way beyond me.  
I've
done several searches for examples and have not been able to find what 
I am

looking for.
Since your file system probably already keeps them ordered, each pair 
will be next to each other.
It would seem a simple matter to compare each file name to the one after 
it and if they match, process the two together.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] How to list/process files with identical character strings

2014-06-24 Thread mark murphy
Hello Python Tutor Community,

This is my first post and I am just getting started with Python, so I
apologize in advance for any lack of etiquette.

I have a directory of several thousand daily satellite images that I need
to process.  Approximately 300 of these images are split in half, so in
just these instances there will be two files for one day.  I need to merge
each pair of split images into one image.

The naming convention of the files is as follows: TDDDHHMMSS, where:
T= one character satellite code
 = 4 digit year
DDD = Julian date
HH = 2-digit hour
MM = 2-digit minute
SS = 2-digit second

What I hope to be able to do is scan the directory, and for each instance
where there are two files where the first 8 characters (TDDD) are
identical, run a process on those two files and place the output (named
TDDD) in a new directory.

The actual processing part should be easy enough for me to figure out.  The
part about finding the split files (each pair of files with the same first
8 characters) and setting those up to be processed is way beyond me.  I've
done several searches for examples and have not been able to find what I am
looking for.

Can anyone help?

Thanks so much!

Mark


-- 
Mark S. Murphy
Alumnus
Department of Geography
msmur...@alumni.unc.edu
951-252-4325
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor