Re: [Tutor] how to extract data only after a certain condition is met

2010-10-10 Thread Josep M. Fontana
Hi,

First let me apologize for taking so long to acknowledge your answers and to
thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry if I left
anyone) for your help and your time.

One of the reasons I took so long in responding (besides having gotten busy
with some urgent matters related to my work) is that I was a bit embarrassed
at realizing how poorly I had defined my problem.
As Alan said, I should at least have told you which operations were giving
me a headache. So I went back to my Python reference books to try to write
some code and thus be able to define my problems more precisely. Only after
I did that, I said to myself, I would come back to the list with more
specific questions.

The only problem is that doing this made me painfully aware of how little
Python I know. Well, actually my problem is not so much that I don't know
Python as that I have very little experience programming in general. Some
years ago I learned a little Perl and basically I used it to do some text
manipulation using regular expressions but that's all my experience. In
order to learn Python, I read a book called Beginning Python: From Novice
to Professional and I was hoping that just by starting to use the knowledge
I had supposedly acquired by reading that book to solve real problems
related to my project I would learn. But this turned out to be much more
difficult than I had expected. Perhaps if I had worked through the excellent
book/tutorial Alan has written (of which I was not aware when I started), I
would be better prepared to confront this problem.

Anyway (sorry for the long intro), since Emile laid out the problem very
clearly, I will use his outline to point out the problems I'm having:

Emile says:
--
Conceptually, you'll need to:

  -a- get the list of file names to change then for each
  -b- determine the new name
  -c- rename the file

For -a- you'll need glob. For -c- use os.rename.  -b- is a bit more
involved.  To break -b- down:

  -b1- break out the x-xx portion of the file name
  -b2- look up the corresponding year in the other file
  -b3- convert the year to the century-half structure
  -b4- put the pieces together to form the new file name

For -b2- I'd suggest building a dictionary from your second files
contents as a first step to facilitate the subsequent lookups.

-

OK. Let's start with -b- . My first problem is that I don't really know how
to go about building a dictionary from the file with the comma separated
values. I've discovered that if I use a file method called 'readlines' I can
create a list whose elements would be each of the lines contained in the
document with all the codes followed by comma followed by the year. Thus if
I do:

fileNameCentury = open(r
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'
).readlines()

Where 'FileNamesYears.txt' is the document with the following info:

A-01, 1278
A-02, 1501
...
N-09, 1384

I get a list of the form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09,
...]

Would this be a good first step to creating a dictionary? It seems to me
that I should be able to iterate over this list in some way and make the
substring before the comma the key and the substring after the comma its
value. The problem is that I don't know how. Reading the book I read has not
prepared me for this. I have the feeling that all the pieces of knowledge I
need to solve the problem where there, but I don't know how to put them
together. Greg mentioned the csv module. I checked the references but I
could not see any way in which I could create a dictionary using that
module, either.

Once I have the dictionary built, what I would have to do is use the os
module (or would it be the glob module?) to get a list of the file names I
want to change and build another loop that would iterate over those file
names and, if the first part of the name (possibly represented by a regular
expression of the form r'[A-Z]-[0-9]+') matches one of the keys in the
dictionary, then a) it would get the value for that key, b) would do the
numerical calculation to determine whether it is the first part of the
century or the second part and c) would insert the string representing this
result right before the extension .txt.

In the abstract it sounds easy, but I don't even know how to start.  Doing
some testing with glob I see that it returns a list of strings representing
the whole paths to all the files whose names I want to manipulate. But in
the reference documents that I have consulted, I see no way to change those
names. How do I go about inserting the information about the century right
before the substring '.txt'?

As you see, I am very green. My embarrassment at realizing how basic my
problems were made me delay writing another message but I decided that if I
don't do it, I will never learn.

Again, thanks so much for all your help.

Josep M.




 Message: 2
 Date: Sat, 2 Oct 2010 17:56:53 +0200
 From: Josep M. Fontana 

Re: [Tutor] how to extract data only after a certain condition is met

2010-10-10 Thread Emile van Sebille

On 10/10/2010 12:35 PM Josep M. Fontana said...
snip

OK. Let's start with -b- . My first problem is that I don't really know how
to go about building a dictionary from the file with the comma separated
values. I've discovered that if I use a file method called 'readlines' I can
create a list whose elements would be each of the lines contained in the
document with all the codes followed by comma followed by the year. Thus if
I do:

fileNameCentury = open(r
'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt'
).readlines()

Where 'FileNamesYears.txt' is the document with the following info:

A-01, 1278
A-02, 1501
...
N-09, 1384

I get a list of the form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09,
...]

Would this be a good first step to creating a dictionary?


Hmmm... It looks like you got a single string -- is that the output from 
read and not readlines?  I also see you're just getting \r which is the 
Mac line terminator.  Are you on a Mac, or was 'FileNamesYears.txt' 
created on a Mac?.  Python's readlines tries to be smart about which 
line terminator to expect, so if there's a mismatch you could have 
issues related to that.  I would have expected you'd get something more 
like: ['A-01,1374\r','A-02,1499\r','A-05,1449\r','A-06,1374\r','A-09, ...]


In any case, as you're getting a single string, you can split a string 
into pieces, for example, print 1\r2\r3\r4\r5.split(\r).  That way 
you can force creation of a list of strings following the format 
X-NN, each of which can be further split with xxx.split(,). 
Note as well that you can assign the results of split to variable names. 
 For example, ky,val = A-01, 1278.split(,) sets ky to A-01 and val 
to 1278.  So, you should be able to create an empty dict, and for each 
line in your file set the dict entry for that line.


Why don't you start there and show us what you get.

HTH,

Emile

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-10 Thread bob gailer

 Emile beat me to it, but here goes anyway...

On 10/10/2010 3:35 PM, Josep M. Fontana wrote:

Hi,

First let me apologize for taking so long to acknowledge your answers 
and to thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry if 
I left anyone) for your help and your time.


One of the reasons I took so long in responding (besides having gotten 
busy with some urgent matters related to my work) is that I was a bit 
embarrassed at realizing how poorly I had defined my problem.
As Alan said, I should at least have told you which operations were 
giving me a headache. So I went back to my Python reference books to 
try to write some code and thus be able to define my problems more 
precisely. Only after I did that, I said to myself, I would come back 
to the list with more specific questions.


The only problem is that doing this made me painfully aware of how 
little Python I know. Well, actually my problem is not so much that I 
don't know Python as that I have very little experience programming in 
general. Some years ago I learned a little Perl and basically I used 
it to do some text manipulation using regular expressions but that's 
all my experience. In order to learn Python, I read a book called 
Beginning Python: From Novice to Professional and I was hoping that 
just by starting to use the knowledge I had supposedly acquired by 
reading that book to solve real problems related to my project I would 
learn. But this turned out to be much more difficult than I had 
expected. Perhaps if I had worked through the excellent book/tutorial 
Alan has written (of which I was not aware when I started), I would be 
better prepared to confront this problem.


Anyway (sorry for the long intro), since Emile laid out the problem 
very clearly, I will use his outline to point out the problems I'm having:


Emile says:
--
Conceptually, you'll need to:

  -a- get the list of file names to change then for each
  -b- determine the new name
  -c- rename the file

For -a- you'll need glob. For -c- use os.rename.  -b- is a bit more
involved.  To break -b- down:

  -b1- break out the x-xx portion of the file name
  -b2- look up the corresponding year in the other file
  -b3- convert the year to the century-half structure
  -b4- put the pieces together to form the new file name

For -b2- I'd suggest building a dictionary from your second files
contents as a first step to facilitate the subsequent lookups.

-

OK. Let's start with -b- . My first problem is that I don't really 
know how to go about building a dictionary from the file with the 
comma separated values. I've discovered that if I use a file method 
called 'readlines' I can create a list whose elements would be each of 
the lines contained in the document with all the codes followed by 
comma followed by the year. Thus if I do:


fileNameCentury = 
open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt').readlines() 



Where 'FileNamesYears.txt' is the document with the following info:

A-01, 1278
A-02, 1501
...
N-09, 1384

I get a list of the 
form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09, ...]




I'm guessing that you are running on a Linux system and that the file 
came from a Mac. This is based on the fact that \r appears in the string 
instead of acting as a line separator.


Regardless -
dct = {}
fileNameCentury = fileNameCentury.split('\r') # gives you ['A-01,1374', 
'A-02,1499', 'A-05,1449', 'A-06,1374', 'A-09, ...]

for pair in fileNameCentury:
  key,value = pair.split(',')
  dct[key] = value

Greg mentioned the csv module. I checked the references but I could 
not see any way in which I could create a dictionary using that module.



True - the csv reader is just another way to get the list of pairs.



Once I have the dictionary built, what I would have to do is use the 
os module (or would it be the glob module?) to get a list of the file 
names I want to change and build another loop that would iterate over 
those file names and, if the first part of the name (possibly 
represented by a regular expression of the form r'[A-Z]-[0-9]+') 
matches one of the keys in the dictionary, then a) it would get the 
value for that key, b) would do the numerical calculation to determine 
whether it is the first part of the century or the second part and c) 
would insert the string representing this result right before the 
extension .txt.


In the abstract it sounds easy, but I don't even know how to start. 
 Doing some testing with glob I see that it returns a list of strings 
representing the whole paths to all the files whose names I want to 
manipulate. But in the reference documents that I have consulted, I 
see no way to change those names. How do I go about inserting the 
information about the century right before the substring '.txt'?



Suppose fn = blah.txt
fn2 = f


As you see, I am very green. My embarrassment at realizing how basic 
my problems were made me delay 

[Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Eduardo Vieira
The other day I was writing a script to extract data from a file from
the line where a text is found to the end of the file. The same
functionality is this sed script:
'1,/regexp/'d
I couldn't put my head to work around this and came up with a solution
using list slicing. But how can I do that? I was experimenting with a
simple list and I came up with this. I wonder if I shouldn't you a
while statement, but how?

a = ['m', 'a', 'r', 'i', 'g', 'o', 'l', 'd']
b = True

for letter in a:
if letter != 'i' and b:
continue
elif letter == 'i':
b = False
else:
print letter

Ok. This works, but I wonder if I shouldn't you a while statement, but how?

Of course this solution is simpler:
extracted = a[a.index(i)+1:]
But I didn't want to build a list in memory with readlines() in the
case of a file.

Thanks for your guidance,

Eduardo
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Peter Otten
Eduardo Vieira wrote:

 The other day I was writing a script to extract data from a file from
 the line where a text is found to the end of the file. The same
 functionality is this sed script:
 '1,/regexp/'d
 I couldn't put my head to work around this and came up with a solution
 using list slicing. But how can I do that? I was experimenting with a
 simple list and I came up with this. I wonder if I shouldn't you a
 while statement, but how?
 
 a = ['m', 'a', 'r', 'i', 'g', 'o', 'l', 'd']
 b = True
 
 for letter in a:
 if letter != 'i' and b:
 continue
 elif letter == 'i':
 b = False
 else:
 print letter
 
 Ok. This works, but I wonder if I shouldn't you a while statement, but
 how?

I would use two for loops:

 a = ['m', 'a', 'r', 'i', 'g', 'o', 'l', 'd']
 ai = iter(a) # make a list iterator
 for letter in ai:
... if letter == i: break
...
 for letter in ai:
... print letter
...
g
o
l
d

Normally a list iterator is created implicitly by writing

for item in some_list:
   ...

but here you have to make one explicitly because you want to reuse it in the 
second loop.

Alternatively, the itertools module has the building blocks for this and 
similar problems:

 from itertools import dropwhile, islice
 def not_an_i(letter):
... return letter != i
...
 for letter in dropwhile(not_an_i, a):
... print letter
...
i
g
o
l
d

OK, let's shave off the first item in the dropwhile(...) sequence:

 for letter in islice(dropwhile(not_an_i, a), 1, None):
... print letter
...
g
o
l
d

Peter

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Knacktus

Am 06.10.2010 18:25, schrieb Eduardo Vieira:

The other day I was writing a script to extract data from a file from
the line where a text is found to the end of the file. The same
functionality is this sed script:
'1,/regexp/'d
I couldn't put my head to work around this and came up with a solution
using list slicing. But how can I do that? I was experimenting with a
simple list and I came up with this. I wonder if I shouldn't you a
while statement, but how?

a = ['m', 'a', 'r', 'i', 'g', 'o', 'l', 'd']
b = True

for letter in a:
if letter != 'i' and b:
continue
elif letter == 'i':
b = False
else:
print letter

Ok. This works, but I wonder if I shouldn't you a while statement, but how?
Why would you want to use a while-loop? You would need to somehow stop 
the iteration (by catching some EOF Exception or the like). I think it's 
fine to use a for-loop as you have a predefined fixed number of 
iterations. I think your approach is OK. Easy to understand. But what if 
there's a second i after the first? In your solution all i are 
skipped. Also, I would choose clearer names:


letters = ['m', 'a', 'r', 'i', 'g', 'o', 'l', 'd', 'i', 'n', 'i', 'o']
skip_letter = True

for letter in letters:
if letter == 'i' and skip_letter:
skip_letter = False
continue  # if you don't want the first occurrence of i
if not skip_letter:
print letter

Cheers,

Jan
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Alan Gauld

Eduardo Vieira eduardo.su...@gmail.com wrote

The other day I was writing a script to extract data from a file 
from

the line where a text is found to the end of the file.


The standard pattern here is to use a sentinel, in pseudo code:

def checkLine(line, start='',end=''):
 if (start in line) or (end in line): return True
 else: return False

startPattern = 'some string (or regex)'
endPattern = 'a concluding string or regex'
sentinel = False
while True
   read line from file
   sentinel = checkLine(line, startPattern, endPattern)
   if sentinel:
   processLine(line)

You can simplify or complexify that in many ways, and you can
add a break check to speed it up if you only expect to process
a few lines.

And checkLine can be as simple or as complex as you like.

HTH,

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Emile van Sebille

On 10/6/2010 9:25 AM Eduardo Vieira said...
snip


Of course this solution is simpler:
extracted = a[a.index(i)+1:]
But I didn't want to build a list in memory with readlines() in the
case of a file.


This is what I do unless the files are _really big_

For-me-really-big-is-over-200Mb-ish-ly y'rs,

Emile

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Joel Goldstick
On Wed, Oct 6, 2010 at 2:50 PM, Emile van Sebille em...@fenx.com wrote:

 On 10/6/2010 9:25 AM Eduardo Vieira said...
 snip


  Of course this solution is simpler:
 extracted = a[a.index(i)+1:]
 But I didn't want to build a list in memory with readlines() in the
 case of a file.


 This is what I do unless the files are _really big_

 For-me-really-big-is-over-200Mb-ish-ly y'rs,

 Emile

 Why not loop with readline() and then the slice.  That way only one line at
 time in memory

-- 
Joel Goldstick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] how to extract data only after a certain condition is met

2010-10-06 Thread Emile van Sebille

On 10/6/2010 11:58 AM Joel Goldstick said...

On Wed, Oct 6, 2010 at 2:50 PM, Emile van Sebilleem...@fenx.com  wrote:


On 10/6/2010 9:25 AM Eduardo Vieira said...
snip


  Of course this solution is simpler:

extracted = a[a.index(i)+1:]
But I didn't want to build a list in memory with readlines() in the
case of a file.



This is what I do unless the files are _really big_

For-me-really-big-is-over-200Mb-ish-ly y'rs,

Emile


Why not loop with readline() and then the slice.  That way only one line at
time in memory



Because I'd consider that a premature optimization.  I don't commonly 
worry about managing the memory footprint until there's a reason to. 
I've found that you can work to minimize the footprint, but as it's 
often indeterminate, you can't really control it.  So I don't.


Emile

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor