Re: [Tutor] Tutor Digest, Vol 38, Issue 10
> > > "Jay Mutter III" <[EMAIL PROTECTED]> wrote > > >> Whether I attempt to just strip the string or attempt to >> >> if line.endswith('No.\r'): >> line = line.rstrip() >> >> It doesn't work. > > Can you try printing the string repr just before the test. > Or even the last 6 characters: > > print repr(line[-6:]) > if line.endswith('No: \n') >line = line.strip() > Alan using your suggestion with the code aove here is the print out: jay-mutter-iiis-computer:~/documents/ToBePrinted jlm1$ python test.py 'andal\r' ' No.\r' ' Dor-\r' ' 14;\r' '315 ;\r' ' No.\r' 'utton\r' 'H' Which appears to me to have 2 lines ending with No. where the LF should be removed and the next line would be on the same line Again thanks for the help/suggestions > See if that helps narrow down the cause... > >> This is an imac running python 2.3.5 under OS-X 10.4.9 > > Shouldn't make any odds. > > Weird, > > Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Tutor Digest, Vol 38, Issue 2
> > > Message: 3 > Date: Sun, 1 Apr 2007 16:42:56 +0100 > From: "Alan Gauld" <[EMAIL PROTECTED]> > Subject: Re: [Tutor] Tutor Digest, Vol 38, Issue 1 > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > > "Rikard Bosnjakovic" <[EMAIL PROTECTED]> wrote > >>>>> s1 = "some line\n" >>>>> s2 = "some line" >>>>> s1.endswith("line"), s2.endswith("line") >> (False, True) >> >> Just skip the if and simply rstrip the string. > see below > Or add \n to the endswith() test string if you really only > want to strip the newline in those cases > > Alan G. > > > > -- > > Message: 4 > Date: Sun, 1 Apr 2007 16:46:05 +0100 > From: "Alan Gauld" <[EMAIL PROTECTED]> > Subject: Re: [Tutor] Tutor Digest, Vol 38, Issue 1 > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > "Jay Mutter III" <[EMAIL PROTECTED]> wrote > >> inp = open('test.txt','r') >> s = inp.readlines() >> for line in s: >> if line.endswith('No.'): >> line = line.rstrip() >> print line > > BTW, > You do know that you can shorten that considerably? > With: > > for line in open('test.txt'): >if line.endswith('No.\n'): > line = line.rstrip() >print line > Whether I attempt to just strip the string or attempt to if line.endswith('No.\r'): line = line.rstrip() It doesn't work. Note - I tried \n, \r and \n\r although text wrangler claims that it does have unix line endings When I used tr to do a few things \n or \r worked fine I tried sed and it didn't work but from the command line in sed using ctrl-v and ctrl-j to insert the line feed it worked although i then could not figure out how to do the same in a script. It is as if the python interpreter doesn't recognize the escaped n (or r) as a line feed. This is an imac running python 2.3.5 under OS-X 10.4.9 Thanks again > -- > Alan Gauld > Author of the Learn to Program web site > http://www.freenetpages.co.uk/hp/alan.gauld > > > > -- > > Message: 5 > Date: 01 Apr 2007 12:17:00 -0400 > From: "Greg Perry" <[EMAIL PROTECTED]> > Subject: [Tutor] Communication between classes > To: > Message-ID: <[EMAIL PROTECTED]> > > Hi again, > > I am still in the process of learning OOP concepts and reasons why > classes should be used instead of functions etc. > > One thing that is not apparent to me is the best way for classes to > communicate with each other. For example, I have created an Args > class that sets a variety of internal variables (__filename, > __outputdir etc) by parsing the argv array from th command line. > What would be the preferred mechanism for returning or passing > along those variables to another class? Maybe by a function method > that returns all of those variables? > > > > > > -- > > Message: 6 > Date: Sun, 01 Apr 2007 20:46:21 +0200 > From: Andrei <[EMAIL PROTECTED]> > Subject: Re: [Tutor] Communication between classes > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi Greg, > > Greg Perry wrote: >> I am still in the process of learning OOP concepts and >> reasons why classes should be used instead of functions etc. >> >> One thing that is not apparent to me is the best way for >> classes to communicate with each other. For example, > > Good question. Unfortunately there's no general rule that you can > apply > and end up with an undisputably perfect solution. > > Classes should communicate on a need-to-know basis. Take for example a > RSS feed reader application. You may have a class representing a feed > and a class representing a post. The feed will know what posts it > contains, but the post probably won't know what feed it comes from. > The > interface would display a list of feeds (without knowing their > contents), a list of posts within a feed (this needs to know both feed > and feed contents) and the contents of a single post (knows only about > an individual post). > >> I have created an Args class that sets a variety of internal >> variables (__filename, __outputd
Re: [Tutor] Tutor Digest, Vol 38, Issue 1
Alan thanks for the response; > Message: 8 > Date: Sun, 1 Apr 2007 08:54:02 +0100 > From: "Alan Gauld" <[EMAIL PROTECTED]> > Subject: Re: [Tutor] Another parsing question > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > > "Jay Mutter III" <[EMAIL PROTECTED]> wrote > >> for line in s: >> jay = patno.findall(line) >> jay2 = "".join(jay[0]) >> print jay2 >> >> and it prints fine up until line 111 which is a line that had >> previously returned [ ] since a number didn't exist on that line and >> then exits with > >> IndexError: list index out of range > > Either try/catch the exception or add an > if not line: continue # or return a default string > >> And as long as i am writing, how can I delete a return at the end of >> a line if the line ends in a certain pattern? >> >> For instance, if line ends with the abbreviation No. > > if line.endswith(string): line = line.rstrip() > For some reason this never works for me; i am using an intel imac with OS X 10.4.9 which has python 2.3.5 inp = open('test.txt','r') s = inp.readlines() for line in s: if line.endswith('No.'): line = line.rstrip() print line and it never ever removes the line feed. (These are unix \r according to Text wrangler) I am beginning to think that it is a problem with readlines. But then i thought well why not inp = open('test.txt','r') s = inp.readlines() for line in s: if line.endswith('No.'): line += s.next() print line, however that doesn't work either which leads me to believe that it is me and my interpretation of the above. Thanks Jay >> I want to join the current line with next line. >> Are lists immutable or can they be changed? > > lists can be changed, tuples cannot. > > HTH, > > -- > Alan Gauld > Author of the Learn to Program web site > http://www.freenetpages.co.uk/hp/alan.gauld > > > > > -- > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > > End of Tutor Digest, Vol 38, Issue 1 > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Another parsing question
Kent; Again thanks for the help. i am not sure if this is what you menat but i put for line in s: jay = patno.findall(line) jay2 = "".join(jay[0]) print jay2 and it prints fine up until line 111 which is a line that had previously returned [ ] since a number didn't exist on that line and then exits with Traceback (most recent call last): File "patentno2.py", line 12, in ? jay2 = "".join(jay[0]) IndexError: list index out of range And as long as i am writing, how can I delete a return at the end of a line if the line ends in a certain pattern? For instance, if line ends with the abbreviation No. I want to join the current line with next line. Are lists immutable or can they be changed? Thanks again jay On Mar 31, 2007, at 2:27 PM, Kent Johnson wrote: > Jay Mutter III wrote: >> I have the following that I am using to extract "numbers' from a file >> ... >> which yields the following >> [('1', '337', '912')] > > ... >> So what do i have above ? A list of tuples? > > Yes, each line is a list containing one tuple containing three > string values. > >> How do I send the output to a file? > > When you print, the values are automatically converted to strings > by calling str() on them. When you use p2.write(), this conversion > is not automatic, you have to do it yourself via > p2.write(str(jay)) > > You can also tell the print statement to output to a file like this: > print >>p2, jay > >> Is there a way to get the output as >> 1337912 instead of [('1', '337', '912')] ? > > In [4]: jay=[('1', '337', '912')] > > jay[0] is the tuple alone: > In [6]: jay[0] > Out[6]: ('1', '337', '912') > > Join the elements together using an empty string as the separator: > In [5]: ''.join(jay[0]) > Out[5]: '1337912' > In [7]: > > Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Another parsing question
Ok after a minute of thought I did solve my second question by simply changing my RE to (r'(\d{1}[\s,.]+\d{3}[\s,.]+\d{3})') but still haven't gotten he first one. On Mar 31, 2007, at 1:39 PM, Jay Mutter III wrote: > I have the following that I am using to extract "numbers' from a file > > > prompt1 = raw_input('What is the file from which you would like a > list of patent numbers? ') > p1 = open(prompt1,'rU') > s = p1.readlines() > prompt2 = raw_input('What is the name of the file to which you > would like to save the list of patent numbers? ') > p2 = open(prompt2,'aU') > patno = re.compile(r'(\d{1})[\s,.]+(\d{3})[\s,.]+(\d{3})') > for line in s: > jay = patno.findall(line) > print jay > > which yields the following > > [('1', '337', '912')] > [('1', '354', '756')] > [('1', '360', '297')] > [('1', '328', '232')] > [('1', '330', '123')] > [('1', '362', '944')] > [('1', '350', '461')] > [('1', '355', '991')] > [('1', '349', '385')] > [('1', '350', '521')] > [('1', '336', '542')] > [('1', '354', '922')] > [('1', '338', '268')] > [('1', '353', '682')] > [('1', '343', '241')] > [('1', '359', '852')] > [('1', '342', '483')] > [('1', '347', '068')] > [('1', '331', '450')] > > if i try to write to a file instead of print to the screen using > p2.write(jay) > i get the message > > Traceback (most recent call last): > File "patentno.py", line 12, in ? > p2.write(jay) > TypeError: argument 1 must be string or read-only character buffer, > not list > > I f I try writelines i get > > Traceback (most recent call last): > File "patentno.py", line 12, in ? > p2.writelines(jay) > TypeError: writelines() argument must be a sequence of strings > jay-mutter-iiis-computer:~/documents/programming/python/patents jlm1$ > > > So what do i have above ? A list of tuples? > > How do I send the output to a file? > Is there a way to get the output as > > 1337912 instead of [('1', '337', '912')] ? > > And as always thanks in advance for the help. > > jay Mutter > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Another parsing question
I have the following that I am using to extract "numbers' from a file prompt1 = raw_input('What is the file from which you would like a list of patent numbers? ') p1 = open(prompt1,'rU') s = p1.readlines() prompt2 = raw_input('What is the name of the file to which you would like to save the list of patent numbers? ') p2 = open(prompt2,'aU') patno = re.compile(r'(\d{1})[\s,.]+(\d{3})[\s,.]+(\d{3})') for line in s: jay = patno.findall(line) print jay which yields the following [('1', '337', '912')] [('1', '354', '756')] [('1', '360', '297')] [('1', '328', '232')] [('1', '330', '123')] [('1', '362', '944')] [('1', '350', '461')] [('1', '355', '991')] [('1', '349', '385')] [('1', '350', '521')] [('1', '336', '542')] [('1', '354', '922')] [('1', '338', '268')] [('1', '353', '682')] [('1', '343', '241')] [('1', '359', '852')] [('1', '342', '483')] [('1', '347', '068')] [('1', '331', '450')] if i try to write to a file instead of print to the screen using p2.write(jay) i get the message Traceback (most recent call last): File "patentno.py", line 12, in ? p2.write(jay) TypeError: argument 1 must be string or read-only character buffer, not list I f I try writelines i get Traceback (most recent call last): File "patentno.py", line 12, in ? p2.writelines(jay) TypeError: writelines() argument must be a sequence of strings jay-mutter-iiis-computer:~/documents/programming/python/patents jlm1$ So what do i have above ? A list of tuples? How do I send the output to a file? Is there a way to get the output as 1337912 instead of [('1', '337', '912')] ? And as always thanks in advance for the help. jay Mutter ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Tutor Digest, Vol 37, Issue 63
> > Message: 1 > Date: Sat, 24 Mar 2007 16:41:22 -0700 (PDT) > From: Jaggo <[EMAIL PROTECTED]> > Subject: Re: [Tutor] Tutor Digest, Vol 37, Issue 62 > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; charset="iso-8859-1" > > Message: 2 > Date: Sat, 24 Mar 2007 19:25:10 -0400 > From: Jay Mutter III > Subject: [Tutor] parsing text > [...] > 1.) when i do readlines and create a list and then print the list it > adds a blank line between every line of text > [...] > ideas? > > Thanks again > > jay > Well, > regarding your first question: > "print string" automatically breaks a line at the end of string. > Use "print string," instead [note that trailin' , .] > yes, thank you for that > [I'm not sure about your n. 2, that's why no answer is included. > > > - > TV dinner still cooling? > Check out "Tonight's Picks" on Yahoo! TV. > -- next part -- > An HTML attachment was scrubbed... > URL: http://mail.python.org/pipermail/tutor/attachments/ > 20070324/2d731ac8/attachment-0001.htm > > -- > > Message: 2 > Date: Sun, 25 Mar 2007 00:00:29 - > From: "Alan Gauld" <[EMAIL PROTECTED]> > Subject: Re: [Tutor] parsing text > To: tutor@python.org > Message-ID: <[EMAIL PROTECTED]> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > "Jay Mutter III" <[EMAIL PROTECTED]> wrote > >> i have the following text: >> >> Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City, >> Mo.Automatic display-sign.No. 1,330 411-Apr. 13 ; v. 273 ; >> p. >> 193. >> Barnett, John II.. Tettenhall, England. Seat of >> motorcars.No. 1.353,708; Sept. 21 ; v. 278; p. 487. Barnett, >> Otto >> R.(See Scott, John M., assignor.) >> >> 1.) when i do readlines and create a list and then print the list it >> adds a blank line between every line of text > > I suspect that's because you are reading a newline character > from the file and print adds a newline of its own. You need to > use rstrip() to take out the newline from the file. > >> 2.)in the second line after p.487 there is the beginning of a new >> line of data only it isn't on a newline. > > I'm not quite sure what you mean here. > It would be helpful if you can show us the problematic output > as well as the input. Also to send us the actual code fragments > that are causing the damage. Yes after i received the reply i realized that i was not very clear. i have a text file of inventors which should have one inventor on each line in alphabetical order but of course the lines do not break at the end of 'p. xxx.' (where p.xxx is the relevant page number) I read the data in as a string figuring that I could then replace p. xxx with a carriage return, somehow write the data out to a text file and the problem would be solved. Not quite so simple given my limited skill set. The following is what I put in (interactively) and what I got out. >>> ss = open('inp.txt') >>> s = ss.read() >>> s.replace('p. ','\n') 'Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City, Mo. Automatic display-sign.\xc2\xa0 \xc2\xa0 No. 1,330 411-Apr. 13 ; v. 273 ;\xc2\xa0\n\n193. Barnett,\xc2\xa0 John\xc2\xa0 II..\xc2\xa0 Tettenhall,\xc2\xa0 England. \xc2\xa0 \xc2\xa0 Seat\xc2\xa0 of \nmotorcars.\xc2\xa0 \xc2\xa0 No. 1.353,708; Sept. 21 ; v. 278; \n487. Barnett,\xc2\xa0\nOtto R.\xc2\xa0 \xc2\xa0 (See Scott, John M., assignor.)' >>> I though about treating it as a list of lines, stripping carriage returns on the basis of some criteria but i have never gotten rstrip to work > >> i tried string.replace(s,'p.','\n') in an attempt to put a CR in but >> it just put the characters\n in the string. > > Dont use the string module functions. Use the string methods, > so it becomes: > > s.replace('p.', '\n') > > However that doesn't explain why you are getting the literal > characters! Can you send us the actual code you are using? > And the output showing the error? > > HTH, > > Alan G. > > > > > -- > > Message: 3 > Date: Sat, 24 Mar 2007 19:32:36 -0500 > From: "Cecilia Alm" <[EMAIL PROTECTED]> > Subject: [Tutor] No need to seed random? > To: tutor@python.org > Message-ID: > <[EMAIL PROTECTED]> > Content-Type: tex
[Tutor] parsing text
Kent thanks for this as I was clearly confused with regards to string and list of strings. I am, however, still having difficulty with how to solve a problem involving a related issue. i have the following text: Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City, Mo.Automatic display-sign.No. 1,330 411-Apr. 13 ; v. 273 ; p. 193. Barnett, John II.. Tettenhall, England. Seat of motorcars.No. 1.353,708; Sept. 21 ; v. 278; p. 487. Barnett, Otto R.(See Scott, John M., assignor.) Barnett. Otto R. (See Sponenburg, Hiram H., assignor) Barnett, William A., Lincoln. Nebr.Attachment for garment- turning machines. No. 1,342,937; June 8 ? v 270 ; p. 313." Barnhart, Clarence D., Brooklyn, assignor to W. S. Rockwell Company, New York. N. Y.Conveyer for furnaces No. 1.333.371 ; Mar. 9 ; v. 272 ; p. 278. Barnhart, Clarence v., Waynesboro, Pa., assignor to J. K. Hoffman and W. M. Raeclitel. Hagerstowu, Md. Seed-planter.No. 1,357.43S: Nov. 2; v. 280: p. 45. Barnhart, John E.(See Haves, J. P.. and Barnhart ) Barnhart,-Mollie E.(See Freeman. Alpheus J., assignor) Barnhill, E. B., and J. Stone, Indianapolis, Ind.Auto-tire 477513 1.) when i do readlines and create a list and then print the list it adds a blank line between every line of text 2.)in the second line after p.487 there is the beginning of a new line of data only it isn't on a newline. i tried string.replace(s,'p.','\n') in an attempt to put a CR in but it just put the characters\n in the string. ideas? Thanks again jay Jay Mutter III wrote: > Thanks for the response > Actually the number of lines this returns is the same number of lines > given when i put it in a text editor (TextWrangler). > Luke had mentioned the same thing earlier but when I do change read to > readlines i get the following > > > Traceback (most recent call last): > File "extract_companies.py", line 17, in ? > count = len(text.splitlines()) > AttributeError: 'list' object has no attribute 'splitlines' I think maybe you are confused about the difference between "all the text of a file in a single string" and "all the lines of a file in a list of strings." When you open() a file and read() the contents, you get all the text of a file in a single string. len() will give you the length of the string (the total file size) and iterating over the string gives you one character at at time. Here is an example of a string: In [1]: s = 'This is text' In [2]: len(s) Out[2]: 12 In [3]: for i in s: ...: print i ...: ...: T h i s i s t e x t On the other hand, if you open() the file and then readlines() from the file, the result is a list of strings, each of with is the contents of one line of the file, up to and including the newline. len() of the list is the number of lines in the list, and iterating the list gives each line in turn. Here is an example of a list of strings: In [4]: l = [ 'line1', 'line2' ] In [5]: len(l) Out[5]: 2 In [6]: for i in l: ...: print i ...: ...: line1 line2 Notice that s and l are *used* exactly the same way with len() and for, but the results are different. As a further wrinkle, there are two easy ways to get all the lines in a file and they give slightly different results. open(...).readlines() returns a list of lines in the file and each line includes the final newline if it was in the file. (The last line will not include a newline if the last line of the file did not.) open(...).read().splitlines() also gives a list of lines in the file, but the newlines are not included. HTH, Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing text file with Python
Script i have to date is below and Thanks to your help i can see some daylight but I still have a few questions 1.) Are there better ways to write this? 2.) As it writes out the one group to the new file for companies it is as if it leaves blank lines behind for if I don't have the elif len (line) . 1 the inventor's file has blank lines in it. 3.) I reopened the inventor's file to get a count of lines but is there a better way to do this? Thanks in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.readlines() count = len(text) print "There are ", count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line.endswith(')\n'): companies.write(line) elif line.endswith(') \n'): companies.write(line) elif len(line) > 1: patentdata.write(line) in_file.close() companies.close() patentdata.close() in_filename2 = raw_input('What was the name of the inventor\'s file ?') in_file2 = open(in_filename2, 'rU') text2 = in_file2.readlines() count = len(text2) print "There are - well until we clean up more - approximately ", count, 'inventor\s in this file' ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Why is it...
Got it - it needs the blank line to signal that code block has ended. Thanks On Mar 22, 2007, at 3:05 PM, Jason Massey wrote: In the interpreter this doesn't work: >>> f = open(r"c:\python24\image.dat") >>> line = f.readline() >>> while line: ... line = f.readline() ... f.close() Traceback ( File "", line 3 f.close() ^ SyntaxError: invalid syntax But this does: >>> f = open(r"c:\python24\image.dat") >>> line = f.readline() >>> while line: ... line = f.readline() ... >>> f.close() >>> Note the differing placement of the f.close() statement, it's not part of the while. On 3/22/07, Kent Johnson <[EMAIL PROTECTED]> wrote: Jay Mutter III wrote: > Why is it that when I run the following interactively > > f = open('Patents-1920.txt') > line = f.readline() > while line: > print line, > line = f.readline() > f.close() > > I get an error message > > File "", line 4 > f.close() > ^ > SyntaxError: invalid syntax > > but if i run it in a script there is no error? Can you copy/paste the actual console transcript? BTW a better way to write this is f = open(...) for line in f: print line, f.close() Kent > > Thanks > > Jay > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Another string question
Andre; Thanks again for the assistance. I have corrected the splitlines error and it works ( well that part of anyway) correctly now. On Mar 23, 2007, at 5:30 AM, Andre Engels wrote: 2007/3/22, Jay Mutter III <[EMAIL PROTECTED]>: I wanted the following to check each line and if it ends in a right parentheses then write the entire line to one file and if not then write the line to anther. It wrote all of the ) to one file and the rest of the line (ie minus the ) to the other file. The line: print "There are ", count, 'lines to process in this file' should give you a hint - don't you think this number was rather high? The problem is that if you do "for line in text" with text being a string, it will not loop over the _lines_ in the string, but over the _characters_ in the string. The easiest solution would be to replace text = in_file.read() by text = in_file.readlines() in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.read() count = len(text.splitlines()) print "There are ", count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line[-1] in ')': companies.write(line) else: patentdata.write(line) in_file.close() companies.close() patentdata.close() Thanks jay ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Andre Engels, [EMAIL PROTECTED] ICQ: 6260644 -- Skype: a_engels ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Should I use python for parsing text?
First thanks for all of the help I am actually starting to see the light. On Mar 22, 2007, at 7:51 AM, Kent Johnson wrote: > Jay Mutter III wrote: >> Kent; >> Thanks for the reply on tutor-python. >> My data file which is just a .txt file created under WinXP by an >> OCR program contains lines like: >> A.-C. Manufacturing Company. (See Sebastian, A. A., >> and Capes, assignors.) >> A. G. A. Railway Light & Signal Co. (See Meden, Elof >> H„ assignor.) >> A-N Company, The. (See Alexander and Nasb, as- >> signors.; >> AN Company, The. (See Nash, It. J., and Alexander, as- >> signors.) >> I use an intel imac running OS x10.4.9 and when I used python to >> append one file to another I got a file that opened in OS X's >> TexEdit program with characters that looked liked Japanese/Chinese >> characters. >> When i pasted them into my mail client (OS X's mail) they were >> then just a sequence of question marks so I am not sure what >> happened. >> Any thoughts??? > > For some reason, after you run the Python program, TexEdit thinks > the file is not ascii data; it seems to think it is utf-8 or a > Chinese encoding. Your original email was utf-8 which points in > that direction but is not conclusive. > > If you zip up and send me the original file and the cleandata.txt > file *exactly as it is produced* by the Python program - not edited > in any way - I will take a look and see if I can guess what is > going on. >> You are correct that it was utf-8 Multiple people were scanning pages and converting to text, some saved as ascii and some saved as unicode The sample used above was utf-8 so after your comment i checked all, put everything as ascii, combined all pieces into one file and normalized the line endings to unix style >> And i tried using the following on the above data: >> in_filename = raw_input('What is the COMPLETE name of the file you >> want to open:') >> in_file = open(in_filename, 'r') > > It wouldn't hurt to use universal newlines here since you are > working cross-platform: > open(in_filename, 'Ur') > corrected this >> text = in_file.readlines() >> num_lines = text.count('\n') > > Here 'text' is a list of lines, so text.count('\n') is counting the > number of blank lines (lines containing only a newline) in your > file. You should use > num_lines = len(text) > changed >> print 'There are', num_lines, 'lines in the file', in_filename >> output = open("cleandata.txt","a")# file for writing data to >> after stripping newline character > > I agree with Luke, use 'w' for now to make sure the file has only > the output of this program. Maybe something already in the file is > making it look like utf-8... > >> # read file, copying each line to new file >> for line in text: >> if len(line) > 1 and line[-2] in ';,-': >> line = line.rstrip() >> output.write(line) >> else: output.write(line) >> print "Data written to cleandata.txt." >> # close the files >> in_file.close() >> output.close() >> As written above it tells me that there are 0 lines which is >> surprising because if I run the first part by itself it tells >> there are 1982 lines ( actually 1983 so i am figuring EOF) >> It copies/writes the data to the cleandata file but it does not >> strip out CR and put data on one line ( a sample of what i am >> trying to get is next) >> A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes, >> assignors.) >> My apologies if i have intruded. > > Please reply on-list in the future. > > Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Another string question
On Mar 23, 2007, at 5:30 AM, Andre Engels wrote: 2007/3/22, Jay Mutter III <[EMAIL PROTECTED]>: I wanted the following to check each line and if it ends in a right parentheses then write the entire line to one file and if not then write the line to anther. It wrote all of the ) to one file and the rest of the line (ie minus the ) to the other file. The line: print "There are ", count, 'lines to process in this file' should give you a hint - don't you think this number was rather high? The problem is that if you do "for line in text" with text being a string, it will not loop over the _lines_ in the string, but over the _characters_ in the string. The easiest solution would be to replace text = in_file.read() by text = in_file.readlines() Thanks for the response Actually the number of lines this returns is the same number of lines given when i put it in a text editor (TextWrangler). Luke had mentioned the same thing earlier but when I do change read to readlines i get the following Traceback (most recent call last): File "extract_companies.py", line 17, in ? count = len(text.splitlines()) AttributeError: 'list' object has no attribute 'splitlines' in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.read() count = len(text.splitlines()) print "There are ", count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line[-1] in ')': companies.write(line) else: patentdata.write(line) in_file.close() companies.close() patentdata.close() Thanks jay ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Andre Engels, [EMAIL PROTECTED] ICQ: 6260644 -- Skype: a_engels ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Another string question
I wanted the following to check each line and if it ends in a right parentheses then write the entire line to one file and if not then write the line to anther. It wrote all of the ) to one file and the rest of the line (ie minus the ) to the other file. in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.read() count = len(text.splitlines()) print "There are ", count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line[-1] in ')': companies.write(line) else: patentdata.write(line) in_file.close() companies.close() patentdata.close() Thanks jay ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Why is it...
Why is it that when I run the following interactively f = open('Patents-1920.txt') line = f.readline() while line: print line, line = f.readline() f.close() I get an error message File "", line 4 f.close() ^ SyntaxError: invalid syntax but if i run it in a script there is no error? Thanks Jay ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Should I use python for parsing text
"Jay Mutter III" wrote > See example next: > A.-C. Manufacturing Company. (See Sebastian, A. A., > and Capes, assignors.) >... >Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ; >Jan. 27 ; v. 270 ; p. 554. > > For instance, I would like to go to end of line and if last > character is a comma or semicolon or hyphen then > remove the CR. It would look something like: output = open('example.fixed','w') for line in file('example.txt'): if line[-1] in ',;-':# check last character line = line.strip() # lose the C/R output.write(line)# write to output else: output.write(line) # append the next line complete with C/R output.close() Working from the above suggestion ( and thank you very much - i did enjoy your online tutorial) I came up with the following: import os import sys import re import string # The next 5 lines are so I have an idea of how many lines i started with in the file. in_filename = raw_input('What is the COMPLETE name of the file you want to open:') in_file = open(in_filename, 'r') text = in_file.read() num_lines = text.count('\n') print 'There are', num_lines, 'lines in the file', in_filename output = open("cleandata.txt","a")# file for writing data to after stripping newline character # read file, copying each line to new file for line in text: if line[:-1] in '-': line = line.rstrip() output.write(line) else: output.write(line) print "Data written to cleandata.txt." # close the files in_file.close() output.close() The above ran with no erros, gave me the number of lines in my orginal file but then when i opened the cleandata.txt file I got: A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最 Company.⠀匀攀攀 Sebastian,䄀⸀ A., and䌀愀瀀攀猀Ⰰ assignors.) A.䜀⸀ A.刀 愀椀氀眀愀礀 Light☀ Signal䌀漀⸀ (See䴀攀搀攀渀 Ⰰ Elof Hassignor.) A-N䌀漀洀瀀愀渀礀Ⰰ The.⠀匀攀攀 Alexander愀渀搀 Nasb,愀猀ⴀ 猀椀最渀漀爀猀⸀㬀 䄀一 Company,吀栀攀⸀ (See一愀猀栀Ⰰ It.䨀⸀Ⰰ and䄀氀攀砀 愀渀搀攀爀Ⰰ as- So what did I do to cause all of the strange characters Plus since this goes on it is as if it removed all \n and not just the ones after a hyphen which I was using as my test case. Thanks again. Jay > Then move line by line through the file and delete everything > after a numerical sequence Slightly more tricky because you need to use a regular expression. But if you know regex then only slightly. > I am wondering if Python would be a good tool Absolutely, its one of the areas where Python excels. > find information on how to accomplish this You could check my tutorial on the three topics: Handling text Handling files Regular Expressions. Also the standard python documentation for the general tutorial (assuming you've done basic programming in some other language before) plus the re module > using something like the unix tool awk or something else?? awk or sed could both be used, but Python is more generally useful so unless you already know awk I'd take the time to learn the basics of Python (a few hours maybe) and use that. -- Alan Gauld Author of the Learn to Program web site http://www.freenetpages.co.uk/hp/alan.gauld___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Should I use python for parsing text
I am using an intel iMac with OS -X 10.4.8. It has Python 2.3.5. My issue is that I have a lot of text ( about 500 pages at the moment) that I need to parse so that I can eliminate info I don't need, break the remainder into fields and put in a database/spreadsheet. See example next: A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes, assignors.) A. G. A. Railway Light & Signal Co. (See Meden, Elof H„ assignor.) A-N Company, The. (See Alexander and Nasb, as- signors.; AN Company, The. (See Nash, It. J., and Alexander, as- signors.) A/S. Arendal Smelteverk.(See Kaaten, Einar, assignor.) A/S. Bjorgums Gevaei'kompani. (See Bjorguni, Nils, as- signor.) A/S Mekano. (Sec Schepeler, Herman A., assignor.) A/S Myrens Verkstad.(See Klling, Jens W. A., assignor.) A/S Stordo Kisgruber. (See Nielsen, C., and Ilelleland, assignors.) A-Z Company, The.'See llanmer, Laurence G., assignor.) Aagaard, Carl L., Rockford, 111. Hand scraping tool. No. 1,345,058 ; July 6; v. 276 ; p. 05. Aalborg, Christian, Wllkinsburg, Pa., assignor to Wcst- inghouse Electric and Manufacturing Company. Trol- ley.No. 1,334,943 ; Mar. 30 ; v. 272 ; p. 741. Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ; Jan. 27 ; v. 270 ; p. 554. For instance, I would like to go to end of line and if last character is a comma or semicolon or hyphen then remove the CR. Then move line by line through the file and delete everything after a numerical sequence I am wondering if Python would be a good tool and if so where can I find information on how to accomplish this or would I be better off using something like the unix tool awk or something else?? Thanks Jay ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor