Re: Text Processing
On Dec 21, 2:01 am, Alexander Kapps alex.ka...@web.de wrote: On 20.12.2011 22:04, Nick Dokos wrote: I have a text file containing such data ; A B C --- -2.0100e-01 8.000e-02 8.000e-05 -2.e-01 0.000e+00 4.800e-04 -1.9900e-01 4.000e-02 1.600e-04 But I only need Section B, and I need to change the notation to ; 8.000e-02 = 0.08 0.000e+00 = 0.00 4.000e-02 = 0.04 Does it have to be python? If not, I'd go with something similar to sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}' Why sed and awk: awk 'NR2 {printf(%.2f\n, $2);}' data.txt And in Python: f = open(data.txt) f.readline() # skip header f.readline() # skip header for line in f: print %02s % float(line.split()[1]) @Jerome ; Your suggestion provided floating point error, it might need some slight modificiation. @Nick ; Sorry mate, it needs to be in Python. But I noted solution in case if I need for another case. @Alexander ; Works as expected. Thank you all for the replies. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text Processing
On 12/20/2011 02:17 PM, Yigit Turgut wrote: Hi all, I have a text file containing such data ; ABC --- -2.0100e-018.000e-028.000e-05 -2.e-010.000e+00 4.800e-04 -1.9900e-014.000e-021.600e-04 But I only need Section B, and I need to change the notation to ; 8.000e-02 = 0.08 0.000e+00 = 0.00 4.000e-02 = 0.04 Text file is approximately 10MB in size. I looked around to see if there is a quick and dirty workaround but there are lots of modules, lots of options.. I am confused. Which module is most suitable for this task ? You probably don't need anything but sys (to parse the command options) and os (maybe). open the file for eachline if one of the header lines, continue separate out the part you want print it, formatted as you like Then just run the script with its stdout redirected, and you've got your new file The details depend on what your experience with Python is, and what version of Python you're running. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Text Processing
Tue, 20 Dec 2011 11:17:15 -0800 (PST) Yigit Turgut a écrit: Hi all, I have a text file containing such data ; ABC --- -2.0100e-018.000e-028.000e-05 -2.e-010.000e+00 4.800e-04 -1.9900e-014.000e-021.600e-04 But I only need Section B, and I need to change the notation to ; 8.000e-02 = 0.08 0.000e+00 = 0.00 4.000e-02 = 0.04 Text file is approximately 10MB in size. I looked around to see if there is a quick and dirty workaround but there are lots of modules, lots of options.. I am confused. Which module is most suitable for this task ? You could try to do it yourself. You'd need to know what seperates the datas. Tabulation character ? Spaces ? Exemple : Input file -- ABC --- -2.0100e-018.000e-028.000e-05 -2.e-010.000e+004.800e-04 -1.9900e-014.000e-021.600e-04 Python code --- # Open file with open('test1.plt','r') as f: b_values = [] # skip as many lines as needed line = f.readline() line = f.readline() line = f.readline() while line: #start = line.find(u\u0009, 0) + 1 #seek Tab start = line.find( , 0) + 4#seek 4 spaces #end = line.find(u\u0009, start) end = line.find( , start) b_values.append(float(line[start:end].strip())) line = f.readline() print b_values It gets trickier if the amount of spaces is not constant. I would then try with regular expressions. Perhaps would regexp be more efficient in any case. -- Jérôme -- http://mail.python.org/mailman/listinfo/python-list
Re: Text Processing
Jérôme jer...@jolimont.fr wrote: Tue, 20 Dec 2011 11:17:15 -0800 (PST) Yigit Turgut a écrit: Hi all, I have a text file containing such data ; ABC --- -2.0100e-018.000e-028.000e-05 -2.e-010.000e+00 4.800e-04 -1.9900e-014.000e-021.600e-04 But I only need Section B, and I need to change the notation to ; 8.000e-02 = 0.08 0.000e+00 = 0.00 4.000e-02 = 0.04 Text file is approximately 10MB in size. I looked around to see if there is a quick and dirty workaround but there are lots of modules, lots of options.. I am confused. Which module is most suitable for this task ? You could try to do it yourself. Does it have to be python? If not, I'd go with something similar to sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}' Nick -- http://mail.python.org/mailman/listinfo/python-list
Re: Text Processing
On 20.12.2011 22:04, Nick Dokos wrote: I have a text file containing such data ; ABC --- -2.0100e-018.000e-028.000e-05 -2.e-010.000e+00 4.800e-04 -1.9900e-014.000e-021.600e-04 But I only need Section B, and I need to change the notation to ; 8.000e-02 = 0.08 0.000e+00 = 0.00 4.000e-02 = 0.04 Does it have to be python? If not, I'd go with something similar to sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}' Why sed and awk: awk 'NR2 {printf(%.2f\n, $2);}' data.txt And in Python: f = open(data.txt) f.readline()# skip header f.readline()# skip header for line in f: print %02s % float(line.split()[1]) -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote: So, a solution by regex is out. Actually, none of the complications you listed appear to exclude regexes. Here's a possible (untested) solution: div class=img ((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+ height=[0-9]+)+) \s*p class=cpt((?:[^]|(?!/p))+)/p \s*/div and corresponding replacement string: figure \1 figcaption\2/figcaption /figure I don't know what dialect Emacs uses for regexes; the above is the Python re dialect. I assume it is translatable. If not, then the above should at least work with other editors, such as Komodo's Find/Replace in Files command. I kept the line breaks here for readability, but for completeness they should be stripped out of the final regex. The possibility of nested HTML in the caption is allowed for by using a negative look-ahead assertion to accept any tag except a closing /p. It would break if you had nested p tags, but then that would be invalid html anyway. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
On Jul 4, 12:13 pm, S.Mandl stefanma...@web.de wrote: Nice. I guess that XSLT would be another (the official) approach for such a task. Is there an XSLT-engine for Emacs? -- Stefan haven't used XSLT, and don't know if there's one in emacs... it'd be nice if someone actually give a example... Xah -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
On Jul 5, 12:17 pm, Ian Kelly ian.g.ke...@gmail.com wrote: On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote: So, a solution by regex is out. Actually, none of the complications you listed appear to exclude regexes. Here's a possible (untested) solution: div class=img ((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+ height=[0-9]+)+) \s*p class=cpt((?:[^]|(?!/p))+)/p \s*/div and corresponding replacement string: figure \1 figcaption\2/figcaption /figure I don't know what dialect Emacs uses for regexes; the above is the Python re dialect. I assume it is translatable. If not, then the above should at least work with other editors, such as Komodo's Find/Replace in Files command. I kept the line breaks here for readability, but for completeness they should be stripped out of the final regex. The possibility of nested HTML in the caption is allowed for by using a negative look-ahead assertion to accept any tag except a closing /p. It would break if you had nested p tags, but then that would be invalid html anyway. Cheers, Ian that's fantastic. Thanks! I'll try it out. Xah -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
On Jul 5, 12:17 pm, Ian Kelly ian.g.ke...@gmail.com wrote: On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote: So, a solution by regex is out. Actually, none of the complications you listed appear to exclude regexes. Here's a possible (untested) solution: div class=img ((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+ height=[0-9]+)+) \s*p class=cpt((?:[^]|(?!/p))+)/p \s*/div and corresponding replacement string: figure \1 figcaption\2/figcaption /figure I don't know what dialect Emacs uses for regexes; the above is the Python re dialect. I assume it is translatable. If not, then the above should at least work with other editors, such as Komodo's Find/Replace in Files command. I kept the line breaks here for readability, but for completeness they should be stripped out of the final regex. The possibility of nested HTML in the caption is allowed for by using a negative look-ahead assertion to accept any tag except a closing /p. It would break if you had nested p tags, but then that would be invalid html anyway. Cheers, Ian emacs regex supports shygroup (the 「(?:…)」) but it doesn't support the negative assertion 「?!…」 though. but in anycase, i can't see how this part would work p class=cpt((?:[^]|(?!/p))+)/p ? Xah -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
On Tue, Jul 5, 2011 at 2:37 PM, Xah Lee xah...@gmail.com wrote: but in anycase, i can't see how this part would work p class=cpt((?:[^]|(?!/p))+)/p It's not that different from the pattern 「alt=[^]+」 earlier in the regex. The capture group accepts one or more characters that either aren't '', or that are '' but are not immediately followed by '/p'. Thus it stops capturing when it sees exactly '/p' without consuming the ''. Using my regex with the example that you posted earlier demonstrates that it works: import re s = '''div class=img ... img src=jamie_cat.jpg alt=jamie's cat width=167 height=106 ... p class=cptjamie's cat! Her blog is a href=http://example.com/ ... jamie/http://example.com/jamie//a/p ... /div''' print re.sub(pattern, replace, s) figure img src=jamie_cat.jpg alt=jamie's cat width=167 height=106 figcaptionjamie's cat! Her blog is a href=http://example.com/ jamie/http://example.com/jamie//a/figcaption /figure Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list
Re: emacs lisp text processing example (html5 figure/figcaption)
haven't used XSLT, and don't know if there's one in emacs... it'd be nice if someone actually give a example... Hi Xah, actually I have to correct myself. HTML is not XML. If it were, you could use a stylesheet like this: ?xml version=1.0 encoding=ISO-8859-1? xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; xsl:template match=p[@class='cpt'] figcaption xsl:value-of select=./ /figcaption /xsl:template xsl:template match=div[@class='img'] figure xsl:apply-templates select=@*|node()/ /figure /xsl:template xsl:template match=@*|node() xsl:copy xsl:apply-templates select=@*|node()/ /xsl:copy /xsl:template /xsl:stylesheet which applied to this document: ?xml version=1.0 encoding=ISO-8859-1? doc h1Just having fun/h1with all the div class=img img src=cat1.jpg alt=my cat width=200 height=200/ img src=cat2.jpg alt=my cat width=200 height=200/ p class=cptmy 2 cats/p /div cats here: h1Just fooling around/h1 div class=img img src=jamie_cat.jpg alt=jamie's cat width=167 height=106/ p class=cptjamie's cat! Her blog is a href=http://example.com/ jamie/http://example.com/jamie//a/p /div /doc would yield: ?xml version=1.0? doc h1Just having fun/h1with all the figure class=img img src=cat1.jpg alt=my cat width=200 height=200/ img src=cat2.jpg alt=my cat width=200 height=200/ figcaptionmy 2 cats/figcaption /figure cats here: h1Just fooling around/h1 figure class=img img src=jamie_cat.jpg alt=jamie's cat width=167 height=106/ figcaptionjamie's cat! Her blog is http://example.com/jamie//figcaption /figure /doc But well, as you don't have XML as input ... there really was no point to my remark. Best, Stefan -- http://mail.python.org/mailman/listinfo/python-list
emacs lisp text processing example (html5 figure/figcaption)
OMG, emacs lisp beats perl/python again! Hiya all, another little emacs lisp tutorial from the tiny Xah's Edu Corner. 〈Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and “figcaption” Tags〉 xahlee.org/emacs/elisp_batch_html5_tag_transform.html plain text version follows. -- Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and “figcaption” Tags Xah Lee, 2011-07-03 Another triumph of using elisp for text processing over perl/python. The Problem -- Summary I want batch transform the image tags in 5 thousand html files to use HTML5's new “figure” and “figcaption” tags. I want to be able to view each change interactively, while optionally give it a “go ahead” to do the whole job in batch. Interactive eye-ball verification on many cases lets me be reasonably sure the transform is done correctly. Yet i don't want to spend days to think/write/test a mathematically correct program that otherwise can be finished in 30 min with human interaction. -- Detail HTML5 has the following new tags: “figure” and “figcaption”. They are used like this: figure img src=cat.jpg alt=my cat width=167 height=106 figcaptionmy cat!/figcaption /figure (For detail, see: HTML5 “figure” & “figurecaption” Tags Browser Support) On my website, i used a similar structure. They look like this: div class=img img src=cat.jpg alt=my cat width=167 height=106 p class=cptmy cat!/p /div So, i want to replace them with the HTML5's new tags. This can be done with a regex. Here's the “find” regex: div class=img ?img src=\([^.]+?\)\.jpg alt=\([^]+?\) width=\([0-9]+?\) height=\([0-9]+?\)? p class=cpt\([^]+?\)/p ?/div Here's the replacement string: figure img src=\1.jpg alt=\2 width=\3 height=\4 figcaption\5/figcaption /figure Then, you can use “find-file” and dired's “dired-do-query-replace- regexp” to work on your 5 thousand pages. Nice. (See: Emacs: Interactively Find & Replace String Patterns on Multiple Files.) However, the problem here is more complicated. The image file may be jpg or png or gif. Also, there may be more than one image per group. Also, the caption part may also contain complicated html. Here's some examples: div class=img img src=cat1.jpg alt=my cat width=200 height=200 img src=cat2.jpg alt=my cat width=200 height=200 p class=cptmy 2 cats/p /div div class=img img src=jamie_cat.jpg alt=jamie's cat width=167 height=106 p class=cptjamie's cat! Her blog is a href=http://example.com/ jamie/http://example.com/jamie//a/p /div So, a solution by regex is out. Solution The solution is pretty simple. Here's the major steps: Use “find-lisp-find-files” to traverse a dir. For each file, open it. Search for the string div class=img Use “sgml-skip-tag-forward” to jump to its closing tag. Save the positions of these tag begin/end positions. Ask user if she wants to replace. If so, do it. (using “delete- region” and “insert”) Repeat. Here's the code: ;; -*- coding: utf-8 -*- ;; 2011-07-03 ;; replace image tags to use html5's “figure” and “figcaption” tags. ;; Example. This: ;; div class=img…/div ;; should become this ;; figure…/figure ;; do this for all files in a dir. ;; rough steps: ;; find the div class=img ;; use sgml-skip-tag-forward to move to the ending tag. ;; save their positions. (defun my-process-file (fpath) process the file at fullpath FPATH ... (let (mybuff p1 p2 p3 p4 ) (setq mybuff (find-file fpath)) (widen) (goto-char 0) ;; in case buffer already open (while (search-forward div class=\img\ nil t) (progn (setq p2 (point) ) (backward-char 17) ; beginning of “div” tag (setq p1 (point) ) (forward-char 1) (sgml-skip-tag-forward 1) ; move to the closing tag (setq p4 (point) ) (backward-char 6) ; beginning of the closing div tag (setq p3 (point) ) (narrow-to-region p1 p4) (when (y-or-n-p replace?) (progn (delete-region p3 p4 ) (goto-char p3) (insert /figure) (delete-region p1 p2 ) (goto-char p1) (insert figure) (widen) ) ) ) ) (when (not (buffer-modified-p mybuff)) (kill-buffer mybuff) ) ) ) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer *xah img/figure replace output* ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files ~/web/xahlee_org/ emacs/ \\.html$)) (princ Done deal!) ) ) Seems pretty simple right? The “p1” and “p2” variables are the positions of start/end of div class=img. The “p3” and “p4” is the start/end of it's closing tag / div. We also used a little trick with “widen” and “narrow-to-region”. It lets me see just the part that i'm interested. It narrows to the beginning/end of the div.img. This makes eye-balling a bit easier. The real time
Re: emacs lisp text processing example (html5 figure/figcaption)
Nice. I guess that XSLT would be another (the official) approach for such a task. Is there an XSLT-engine for Emacs? -- Stefan -- http://mail.python.org/mailman/listinfo/python-list
Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?
Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin? (I've read the cross compiler claims about massive increases in pure numeric performance). I have 3 use cases I'm considering for Python-to-C++ cross-compilers for generating 32-bit Python extension modules for Python 2.7 for Windows. 1. Parsing UTF-8 files (basic Python with lots of string processing and dict lookups) 2. Generating UTF-8 files from nested list/dict structures 3. Parsing large ASCII CSV-like files and using dict's to calculate simple statistics like running totals, min, max, etc. Are any of these text processing scenarios good use cases for tools like Cython, Pyrex, or ShedSkin? Are any of these specifically bad use cases for these tools? We've tried Psyco and it has sped up some of our parsing utilities by 200%. But Psyco doesn't support Python 2.7 yet and we're committed to using Python 2.7 moving forward. Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?
pyt...@bdurham.com, 16.12.2010 21:03: Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin? (I've read the cross compiler claims about massive increases in pure numeric performance). Cython is generally a good choice for string processing, simply because it can drop a lot of code into plain C, such as character iteration and comparison. Depending on what kind of operations you do, you can get speed-ups of 100x or more for that. http://docs.cython.org/src/tutorial/strings.html However, when it comes to dict lookups, it uses CPython's own dicts which are heavily optimised for string lookups already. So the speedup in that area will likely stay below 30%. Similarly, encoding and decoding use Python's codecs, so don't expect a major difference there. I have 3 use cases I'm considering for Python-to-C++ cross-compilers for generating 32-bit Python extension modules for Python 2.7 for Windows. 1. Parsing UTF-8 files (basic Python with lots of string processing and dict lookups) Parsing sounds like something that could easily benefit from Cython compilation. 2. Generating UTF-8 files from nested list/dict structures That should be much faster in Cython, too, simply because iteration on builtin types is much faster than in Python. 3. Parsing large ASCII CSV-like files and using dict's to calculate simple statistics like running totals, min, max, etc. Again, parsing will be much faster, especially when reading from raw C files (which would also enable freeing the GIL, in case you want to use multi-threading). The rest may not win that much. A nice feature of Cython is that you do not have to go low-level right away. You can use all the niceness of Python, and only push the code closer to C level where your benchmarks point you. And if you really have to go all the way down to C, it's just a declaration away. Are any of these text processing scenarios good use cases for tools like Cython, Pyrex, or ShedSkin? Are any of these specifically bad use cases for these tools? Pyrex isn't worth trying here, simply because you'd have to invest a lot more work to make it as fast as what Cython gives you anyway. ShedSkin may be worth a try, depending on how well you get your ShedSkin module integrated with CPython. (It seems that it has support for building extension modules by now, but I have no idea how well that is fleshed out). We've tried Psyco and it has sped up some of our parsing utilities by 200%. But Psyco doesn't support Python 2.7 yet and we're committed to using Python 2.7 moving forward. If 3x is not enough for you, I strongly suggest you try Cython. The C code that it generates compiles nicely in all major Python versions, currently from 2.3 to 3.2. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Simple Text Processing
New to Python. I can solve the problem in perl by using split() to an array. Can't figure it out in Python. I'm reading variable lines of text. I want to use the first number I find. The problem is the lines are variable. Input example: this is a number: 1 here are some numbers 1 2 3 4 In both lines I am only interested in the 1. I can't figure out how to use split() as it appears to make me know how many space separated words are in the line. I do not know this. I use: a,b,c,e = split() to get the first line in the example. The second line causes a runtime exception. Can I use split for this? Is there another simple way to break the words into an array that I can loop over? Thanks. Andy -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing
On Thu, Sep 10, 2009 at 11:36 AM, AJAskey aske...@gmail.com wrote: New to Python. I can solve the problem in perl by using split() to an array. Can't figure it out in Python. I'm reading variable lines of text. I want to use the first number I find. The problem is the lines are variable. Input example: this is a number: 1 here are some numbers 1 2 3 4 In both lines I am only interested in the 1. I can't figure out how to use split() as it appears to make me know how many space separated words are in the line. I do not know this. I use: a,b,c,e = split() to get the first line in the example. The second line causes a runtime exception. Can I use split for this? Is there another simple way to break the words into an array that I can loop over? line = here are some numbers 1 2 3 4 a = line.split() a ['here', 'are', 'some', 'numbers', '1', '2', '3', '4'] #Python 3 only ... a,b,c,d,*e = line.split() e ['1', '2', '3', '4'] Thanks. Andy -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing
Never mind. I guess I had been trying to make it more difficult than it is. As a note, I can work on something for 10 hours and not figure it out. But the second I post to a group, then I immediately figure it out myself. Strange snake this Python... Example for anyone else interested: line = this is a line print line a = line.split() print a print a[0] print a[1] print a[2] print a[3] -- OUTPUT: this is a line ['this', 'is', 'a', 'line'] this is a line On Sep 10, 11:36 am, AJAskey aske...@gmail.com wrote: New to Python. I can solve the problem in perl by using split() to an array. Can't figure it out in Python. I'm reading variable lines of text. I want to use the first number I find. The problem is the lines are variable. Input example: this is a number: 1 here are some numbers 1 2 3 4 In both lines I am only interested in the 1. I can't figure out how to use split() as it appears to make me know how many space separated words are in the line. I do not know this. I use: a,b,c,e = split() to get the first line in the example. The second line causes a runtime exception. Can I use split for this? Is there another simple way to break the words into an array that I can loop over? Thanks. Andy -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing SOLVED
Thanks Black Jack Working -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing
On Thu, 25 Sep 2008 15:51:28 +0100, [EMAIL PROTECTED] wrote: I have string like follow 12560/ABC,12567/BC,123,567,890/JK I want above string to group like as follow (12560,ABC) (12567,BC) (123,567,890,JK) i try regular expression i am able to get first two not the third one. can regular expression given data in different groups Without regular expressions: def group(string): result = list() for item in string.split(','): if '/' in item: result.extend(item.split('/')) yield tuple(result) result = list() else: result.append(item) def main(): string = '12560/ABC,12567/BC,123,567,890/JK' print list(group(string)) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing
You can do it with regexps too : -- import re to_watch = re.compile(r(?Pnumber\d+)[/](?Pletter[A-Z]+)) final_list = to_watch.findall(12560/ABC,12567/BC,123,567,890/JK) for number,word in final_list : print number:%s -- word: %s%(number,word) -- the output is : number:12560 -- word: ABC number:12567 -- word: BC number:890 -- word: JK See you, Kib². -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing
On Sep 25, 6:34 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote: On Thu, 25 Sep 2008 15:51:28 +0100, [EMAIL PROTECTED] wrote: I have string like follow 12560/ABC,12567/BC,123,567,890/JK I want above string to group like as follow (12560,ABC) (12567,BC) (123,567,890,JK) i try regular expression i am able to get first two not the third one. can regular expression given data in different groups Without regular expressions: def group(string): result = list() for item in string.split(','): if '/' in item: result.extend(item.split('/')) yield tuple(result) result = list() else: result.append(item) def main(): string = '12560/ABC,12567/BC,123,567,890/JK' print list(group(string)) How about: string = 12560/ABC,12567/BC,123,567,890/JK r = re.findall(r(\d+(?:,\d+)*/\w+), string) r ['12560/ABC', '12567/BC', '123,567,890/JK'] [tuple(x.replace(,, /).split(/)) for x in r] [('12560', 'ABC'), ('12567', 'BC'), ('123', '567', '890', 'JK')] -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing
On Sep 25, 9:51 am, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have string like follow 12560/ABC,12567/BC,123,567,890/JK I want above string to group like as follow (12560,ABC) (12567,BC) (123,567,890,JK) i try regular expression i am able to get first two not the third one. can regular expression given data in different groups Looks like each item is: - a list of 1 or more integers, in a comma-delimited list - a slash - a word composed of alpha characters And the whole thing is a list of items in a comma-delimited list Now to implement that in pyparsing: data = 12560/ABC,12567/BC,123,567,890/JK from pyparsing import Suppress, delimitedList, Word, alphas, nums, Group SLASH = Suppress('/') dataitem = delimitedList(Word(nums)) + SLASH + Word(alphas) dataformat = delimitedList(Group(dataitem)) map(tuple, dataformat.parseString(data)) [('12560', 'ABC'), ('12567', 'BC'), ('123', '567', '890', 'JK')] Wah-lah! (as one of my wife's 1st graders announced in one of his school papers) -- Paul -- http://mail.python.org/mailman/listinfo/python-list
emacs lisp as text processing language...
Text Processing with Emacs Lisp Xah Lee, 2007-10-29 This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know elisp, first take a gander at Emacs Lisp Basics. HTML version with links and colors is at: http://xahlee.org/emacs/elisp_text_processing.html Following this post as a separate post, is some relevant (i hope) remark about Perl and Python. - THE PROBLEM Summary I want to write a elisp program, that process a list of given files. Each file is a HTML file. For each file, i want to remove the link to itself, in its page navigation bar. More specifically, each file have a page navigation bar in this format: div class=pagesGoto Page: a href=1.html1/a, a href=2.html2/a, a href=3.html3/a, a href=4.html3/ a, .../div. where the file names and link texts are all arbitrary. (not as 1, 2, 3 shown here.) The link to itself needs to be removed. Detail My website has over 3 thousand files; many of the pages is a series. For example, i have a article on Algorithmic Mathematical Art, which is broken into 3 HTML pages. So, at the bottom of each page, i have a page navigation bar with code like this: div class=pagesGoto Page: a href=20040113_cmaci_larcu.html1/ a, a href=cmaci_larcu2.html2/a, a href=cmaci_larcu3.html3/ a/div In a browser, it would look like this: i/page tag Note that the link to the page itself really shouldn't be a link. There are a total of 134 pages scattered about in various directories that has this page navigation bar. I need some automated way to process these files and remove the self-link. I've been programing in perl professionally from 1998 to 2002 full time. Typically, for this task in perl (or Python), i'd open each file, read in the file, then use regex to do the replacement, then write out the file. For replacement that span over several lines, the regex needs to act on the whole file (as opposed to one line at a time). The regex can become quite complex or reaching its limit. For a more robust solution, a XML/HTML parser package can be used to read in the file into a structured representation, then process that. Using a HTML parser is a bit involved. Then, as usual, one may need to create backups of the original files, and also deal with maintaining the file's meta info such as keeping the same permission bits. In summary, if the particular text-processing required is not simple, then the coding gets fairly complex quickly, even if job is trivial in principle. With emacs lisp, the task is vastly simplified, because emacs reads in a file into its buffer representation. With buffers, one can move a pointer back and forth, search and delete or insert text arbitrarily, with the entire emacs lisp's suite of functions designed for processing text, as well the entire emacs environment that automatically deals with maintaining file. (symbolic links, hard links, auto-backup system, file meta-info maintaince, file locking, remote files... etc). We proceed to write a elisp code to solve this problem. - SOLUTION Here's are the steps we need to do for each file: * open the file in a buffer * move cursor to the page navigation text. * move cursor to file name * run sgml-delete-tag (removes the link) * save file * close buffer We begin by writing a test code to process a single file. (defun xx () temp. experimental code (interactive) (let (fpath fname mybuffer) (setq fpath /Users/xah/test1.html) (setq fname (file-name-nondirectory fpath)) (setq mybuffer (find-file fpath)) (search-forward div class=\pages\Goto Page:) (search-forward fname) (sgml-delete-tag 1) (save-buffer) (kill-buffer mybuffer))) First of all, create files test1.html, test2.html, test3.html in a temp directory for testing this code. Each file will contain this page navigation line: div class=pagesGoto Page: a href=test1.htmlsome1/a, a href=test2.htmlanother/a, a href=test3.htmlxyz3/a/div Note that in actual files, the page-nav string may not be in a single line. The elisp code above is fairly simple and self-explanatory. The file opening function find-file is found from elisp doc section “Files”. The cursor moving function search-forward is in “Searching and Matching”, the save or close buffer fuctions are in section “Buffer”. Reference: Elisp Manual: Files. Reference: Elisp Manual: Buffers. Reference: Elisp Manual: Searching-and-Matching. The interesting part is calling the function sgml-delete-tag. It is a function loaded by html-mode (which is automatically loaded when a html file is opened). What sgml-delete-tag does is to delete the tag that encloses the cursor (both the opening and closing tags will de deleted). The cursor can be anywhere in the beginning angle bracket of the opening to the ending angle bracket of the closing tag. This sgml- delete-tag function
Re: emacs lisp as text processing language...
... continued from previous post. PS I'm cross-posting this post to perl and python groups because i find that it being a little know fact that emacs lisp's power in the area of text processing, are far beyond Perl (or Python). ... i worked as a professional perl programer since 1998. I started to study elisp as a hobby since 2005. (i started to use emacs daily since 1998) It is only today, while i was studying elisp's file and buffer related functions, that i realized how elisp can be used as a general text processing language, and in fact is a dedicated language for this task, with powers quite beyond Perl (or Python, PHP (Ruby, java, c etc) etc). This realization surprised me, because it is well-known that Perl is the de facto language for text processing, and emacs lisp for this is almost unknown (outside of elisp developers). The surprise was exasperated by the fact that Emacs Lisp existed before perl by almost a decade. (Albeit Emacs Lisp is not suitable for writing general applications.) My study about lisp as a text processing tool today, remind me of a article i read in 2000: “Ilya Regularly Expresses”, of a interview with Dr Ilya Zakharevich (author of cperl-mode.el and a major contributor to the Perl language). In the article, he mentioned something about Perl's lack of text processing primitives that are in emacs, which i did not fully understand at the time. (i don't know elisp at the time) The article is at: http://www.perl.com/lpt/a/2000/09/ilya.html Here's the relevant excerpt: « Let me also mention that classifying the text handling facilities of Perl as extremely agile gives me the willies. Perl's regular expressions are indeed more convenient than in other languages. However, the lack of a lot of key text-processing ingredients makes Perl solutions for many averagely complicated tasks either extremely slow, or not easier to maintain than solutions in other languages (and in some cases both). I wrote a (heuristic-driven) Perlish syntax parser and transformer in Emacs Lisp, and though Perl as a language is incomparably friendlier than Lisps, I would not be even able of thinking about rewriting this tool in Perl: there are just not enough text-handling primitives hardwired into Perl. I will need to code all these primitives first. And having these primitives coded in Perl, the solution would turn out to be (possibly) hundreds times slower than the built-in Emacs operations. My current conjecture on why people classify Perl as an agile text- handler (in addition to obvious traits of false advertisements) is that most of the problems to handle are more or less trivial (system maintenance-type problems). For such problems Perl indeed shines. But between having simple solutions for simple problems and having it possible to solve complicated problems, there is a principle of having moderately complicated solutions for moderately complicated problems. There is no reason for Perl to be not capable of satisfying this requirement, but currently Perl needs improvement in this regard. » Xah [EMAIL PROTECTED] ∑ http://xahlee.org/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
[EMAIL PROTECTED] wrote: And now for something completely different... I've been reading up a bit about Python and Excel and I quickly told the program to output to Excel quite easily. However, what if the input file were a Word document? I can't seem to find much information about parsing Word files. What could I add to make the same program work for a Word file? Word files are not human-readable. You parse them using Dispatch(Word.Application), just the way you wrote the Excel file. I believe there are some third-party modules that will read a Word file a little more directly. -- Tim Roberts, [EMAIL PROTECTED] Providenza Boekelheide, Inc. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
patrick.waldo wrote: manipulation? Also, I conceptually get it, but would you mind walking me through for key, group in groupby(instream, unicode.isspace): if not key: yield .join(group) itertools.groupby() splits a sequence into groups with the same key; e. g. to group names by their first letter you'd do the following: def first_letter(s): return s[:1] ... for key, group in groupby([Anne, Andrew, Bill, Brett, Alex], first_letter): ... print --- %s --- % key ... for item in group: ... print item ... --- A --- Anne Andrew --- B --- Bill Brett --- A --- Alex Note that there are two groups with the same initial; groupby() considers only consecutive items in the sequence for the same group. In your case the sequence are the lines in the file, converted to unicode strings -- the key is a boolean indicating whether the line consists entirely of whitespace or not, u\n.isspace() True ualpha\n.isspace() False but I call it slightly differently, as an unbound method: unicode.isspace(ualpha\n) False This is only possible because all items in the sequence are known to be unicode instances. So far we have, using a list instead of a file: instream = [ualpha\n, ubeta\n, u\n, ugamma\n, u\n, u\n, udelta\n] for key, group in groupby(instream, unicode.isspace): ... print --- %s --- % key ... for item in group: ... print repr(item) ... --- False --- u'alpha\n' u'beta\n' --- True --- u'\n' --- False --- u'gamma\n' --- True --- u'\n' u'\n' --- False --- u'delta\n' As you see, groups with real data alternate with groups that contain only blank lines, and the key for the latter is True, so we can skip them with if not key: # it's not a separator group yield group As the final refinement we join all lines of the group into a single string .join(group) u'alpha\nbeta\n' and that's it. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
And now for something completely different... I see a lot of COM stuff with Python for excel...and I quickly made the same program output to excel. What if the input file were a Word document? Where is there information about manipulating word documents, or what could I add to make the same program work for word? Again thanks a lot. I'll start hitting some books about this sort of text manipulation. The Excel add on: import codecs import re from win32com.client import Dispatch path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number tokens = input.read().split() def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) = 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product xlApp = Dispatch(Excel.Application) xlApp.Visible = 1 xlApp.Workbooks.Add() c = 1 for element in iter_elements(tokens): xlApp.ActiveSheet.Cells(c,1).Value = element[0] xlApp.ActiveSheet.Cells(c,2).Value = element[1] xlApp.ActiveSheet.Cells(c,3).Value = element[2] xlApp.ActiveSheet.Cells(c,4).Value = element[3] c = c + 1 xlApp.ActiveWorkbook.Close(SaveChanges=1) xlApp.Quit() xlApp.Visible = 0 del xlApp input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
And now for something completely different... I've been reading up a bit about Python and Excel and I quickly told the program to output to Excel quite easily. However, what if the input file were a Word document? I can't seem to find much information about parsing Word files. What could I add to make the same program work for a Word file? Again thanks a lot. And the Excel Add on... import codecs import re from win32com.client import Dispatch path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number tokens = input.read().split() def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) = 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product xlApp = Dispatch(Excel.Application) xlApp.Visible = 1 xlApp.Workbooks.Add() c = 1 for element in iter_elements(tokens): xlApp.ActiveSheet.Cells(c,1).Value = element[0] xlApp.ActiveSheet.Cells(c,2).Value = element[1] xlApp.ActiveSheet.Cells(c,3).Value = element[2] xlApp.ActiveSheet.Cells(c,4).Value = element[3] c = c + 1 xlApp.ActiveWorkbook.Close(SaveChanges=1) xlApp.Quit() xlApp.Visible = 0 del xlApp input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
lines = open('your_file.txt').readlines()[:4] print lines print map(len, lines) gave me: ['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n'] [28, 32, 1, 18] I think it means that I'm still at option 3. I got the line by line part. My code is a lot cleaner now: import codecs path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to combine the files correctly file = u'|'.join(tokens) #this does put '|' in between print file + u'\n' output.write(file + u'\r\n') input.close() output.close() my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... and after the program I get: 200-720-7|69-93-2| kyselina|mocová||C5H4N4O3 200-001-8|50-00-0| formaldehyd|CH2O| 200-002-3| 50-01-1| guanidínium-chlorid|CH5N3.ClH| etc... So, I am sort of back at the start again. If I add: tokens = line.strip().split() for token in tokens: print token I get all the single tokens, which I thought I could then put together, except when I did: for token in tokens: s = u'|'.join(token) print s I got ?|2|0|0|-|7|2|0|-|7, etc... How can I join these together into nice neat little lines? When I try to store the tokens in a list, the tokens double and I don't know why. I can work on getting the chemical names together after...baby steps, or maybe I am just missing something obvious. The first two numbers will always be the same three digits-three digits-one digit and then two digits-two digits-one digit. My intuition tells me that I need to add an if statement that says, if the first two numbers follow the pattern, then continue, if they don't (ie a chemical name was accidently split apart) then the third entry needs to be put together. Something like if tokens.startswith('pattern') == true Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have a couple O'Reilly books, but they don't seem to have a straightforward example for this kind of text manipulation. Patrick On Oct 14, 11:17 pm, John Machin [EMAIL PROTECTED] wrote: On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? Your input file could be in one of THREE formats: (1) fields are separated by TAB characters (represented in Python by the escape sequence '\t', and equivalent to '\x09') (2) fields are fixed width and padded with spaces (3) fields are separated by a random number of whitespace characters (and can contain spaces). What makes you sure that you have format 3? You might like to try something like lines = open('your_file.txt').readlines()[:4] print lines print map(len, lines) This will print a *precise* representation of what is in the first four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
lines = open('your_file.txt').readlines()[:4] print lines print map(len, lines) gave me: ['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n'] [28, 32, 1, 18] I think it means that I'm still at option 3. I got the line by line part. My code is a lot cleaner now: import codecs path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to combine the files correctly file = u'|'.join(tokens) #this does put '|' in between print file + u'\n' output.write(file + u'\r\n') input.close() output.close() my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... and after the program I get: 200-720-7|69-93-2| kyselina|mocová||C5H4N4O3 200-001-8|50-00-0| formaldehyd|CH2O| 200-002-3| 50-01-1| guanidínium-chlorid|CH5N3.ClH| etc... So, I am sort of back at the start again. If I add: tokens = line.strip().split() for token in tokens: print token I get all the single tokens, which I thought I could then put together, except when I did: for token in tokens: s = u'|'.join(token) print s I got ?|2|0|0|-|7|2|0|-|7, etc... How can I join these together into nice neat little lines? When I try to store the tokens in a list, the tokens double and I don't know why. I can work on getting the chemical names together after...baby steps, or maybe I am just missing something obvious. The first two numbers will always be the same three digits-three digits-one digit and then two digits-two digits-one digit. This seems to be on the only pattern. My intuition tells me that I need to add an if statement that says, if the first two numbers follow the pattern, then continue, if they don't (ie a chemical name was accidently split apart) then the third entry needs to be put together. Something like if tokens[1] and tokens[2] startswith('pattern') == true tokens[2] = join(tokens[2]:tokens[3]) token[3] = token[4] del token[4] but the code isn't right...any ideas? Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have a couple O'Reilly books, but they don't seem to have a straightforward example for this kind of text manipulation. Patrick On Oct 14, 11:17 pm, John Machin [EMAIL PROTECTED] wrote: On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? Your input file could be in one of THREE formats: (1) fields are separated by TAB characters (represented in Python by the escape sequence '\t', and equivalent to '\x09') (2) fields are fixed width and padded with spaces (3) fields are separated by a random number of whitespace characters (and can contain spaces). What makes you sure that you have format 3? You might like to try something like lines = open('your_file.txt').readlines()[:4] print lines print map(len, lines) This will print a *precise* representation of what is in the first four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Mon, 15 Oct 2007 10:47:16 +, patrick.waldo wrote: my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... That's quite irregular so it is not that straightforward. One way is to split everything into words, start a record by taking the first two elements and then look for the start of the next record that looks like three numbers concatenated by '-' characters. Quick and dirty hack: import codecs import re NR_RE = re.compile(r'^\d+-\d+-\d+$') def iter_elements(tokens): tokens = iter(tokens) try: nr_a = tokens.next() while True: nr_b = tokens.next() items = list() for item in tokens: if NR_RE.match(item): yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) nr_a = item break else: items.append(item) except StopIteration: yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) def main(): in_file = codecs.open('test.txt', 'r', 'utf-8') tokens = in_file.read().split() in_file.close() for element in iter_elements(tokens): print '|'.join(element) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote: On Mon, 15 Oct 2007 10:47:16 +, patrick.waldo wrote: my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... That's quite irregular so it is not that straightforward. One way is to split everything into words, start a record by taking the first two elements and then look for the start of the next record that looks like three numbers concatenated by '-' characters. Quick and dirty hack: import codecs import re NR_RE = re.compile(r'^\d+-\d+-\d+$') def iter_elements(tokens): tokens = iter(tokens) try: nr_a = tokens.next() while True: nr_b = tokens.next() items = list() for item in tokens: if NR_RE.match(item): yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) nr_a = item break else: items.append(item) except StopIteration: yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) Maybe this is a bit more readable? def iter_elements(tokens): chem = [] for tok in tokens: if NR_RE.match(tok) and len(chem) = 4: chem[2:-1] = [' '.join(chem[2:-1])] yield chem chem = [] chem.append(tok) yield chem -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
patrick.waldo wrote: my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH Assuming that the records are always separated by blank lines and only the third field in a record may contain spaces the following might work: import codecs from itertools import groupby path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt def fields(s): parts = s.split() return parts[0], parts[1], .join(parts[2:-1]), parts[-1] def records(instream): for key, group in groupby(instream, unicode.isspace): if not key: yield .join(group) if __name__ == __main__: outstream = codecs.open(path2, 'w', 'utf8') for record in records(codecs.open(path, r, utf8)): outstream.write(|.join(fields(record))) outstream.write(\n) Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
Wow, thank you all. All three work. To output correctly I needed to add: output.write(\r\n) This is really a great help!! Because of my limited Python knowledge, I will need to try to figure out exactly how they work for future text manipulation and for my own knowledge. Could you recommend some resources for this kind of text manipulation? Also, I conceptually get it, but would you mind walking me through for tok in tokens: if NR_RE.match(tok) and len(chem) = 4: chem[2:-1] = [' '.join(chem[2:-1])] yield chem chem = [] chem.append(tok) and for key, group in groupby(instream, unicode.isspace): if not key: yield .join(group) Thanks again, Patrick On Oct 15, 2:16 pm, Peter Otten [EMAIL PROTECTED] wrote: patrick.waldo wrote: my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH Assuming that the records are always separated by blank lines and only the third field in a record may contain spaces the following might work: import codecs from itertools import groupby path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt def fields(s): parts = s.split() return parts[0], parts[1], .join(parts[2:-1]), parts[-1] def records(instream): for key, group in groupby(instream, unicode.isspace): if not key: yield .join(group) if __name__ == __main__: outstream = codecs.open(path2, 'w', 'utf8') for record in records(codecs.open(path, r, utf8)): outstream.write(|.join(fields(record))) outstream.write(\n) Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Oct 15, 10:08 pm, [EMAIL PROTECTED] wrote: Because of my limited Python knowledge, I will need to try to figure out exactly how they work for future text manipulation and for my own knowledge. Could you recommend some resources for this kind of text manipulation? Also, I conceptually get it, but would you mind walking me through for tok in tokens: if NR_RE.match(tok) and len(chem) = 4: chem[2:-1] = [' '.join(chem[2:-1])] yield chem chem = [] chem.append(tok) Sure: 'chem' is a list of all the data associated with one chemical. When a token (tok) arrives that is matched by NR_RE (ie 3 lots of digits separated by dots), it's assumed that this is the start of a new chemical if we've already got 4 pieces of data. Then, we join the name back up (as was explained in earlier posts), and 'yield chem' yields up the chemical so far; and a new chemical is started (by emptying the list). Whatever tok is, it's added to the end of the current chemical data. Add some print statements in to watch it work if you can't get it. This code uses exactly the same algorithm as Marc's code - it's just a bit clearer (or at least, I thought so). Oh, and it returns a list rather than a tuple, but that makes no difference. -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Oct 14, 8:48 am, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. Pyparsing might be overkill for this example, but it is a good sample for a demo. If you end up doing lots of data extraction like this, pyparsing is a useful tool. In pyparsing, you define expressions using pyparsing classes and built-in strings, then use the constructed pyparsing expression to parse the data (using parseString, scanString, searchString, or transformString). In this example, searchString is the easiest to use. After the parsing is done, the parsed fields are returned in a ParseResults object, which has some list and some dict style behavior. I've given each field a name based on your post, so that you can read the tokens right out of the results as if they were attributes of an object. This example emits your '|' delimited data, but the commented lines show how you could access the individually parsed fields, too. Learn more about pyparsing at http://pyparsing.wikispaces.com/ . -- Paul # -*- coding: iso-8859-15 -*- data = 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit # define expressions for each part in the input data # a numeric id starts with a number, and is followed by # any number of numbers or '-'s numericId = Word(nums, nums+-) # a chemical name is one or more words, each made up of # alphas (including 8-bit alphas) or '-'s chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+-)) # when returning the chemical name, rejoin the separate # words into a single string, with spaces chemName.setParseAction(lambda t: .join(t)) # a chemical formula is a 'word' starting with an uppercase # alpha, followed by uppercase alphas or numbers chemFormula = Word(alphas.upper(), alphas.upper()+nums) # put all expressions into overall form, and attach field names entry = numericId(EINECS) + \ numericId(CAS) + \ chemName(name) + \ chemFormula(formula) # search through input data, and print out retrieved data for chemData in entry.searchString(data): print %(EINECS)s|%(CAS)s|%(name)s|%(formula)s % chemData # or print each field by itself # print chemData.EINECS # print chemData.CAS # print chemData.name # print chemData.formula # print prints: 200-720-7|69-93-2|kyselina mocová|C5H4N4O3 200-001-8|50-00-0|formaldehyd|CH2O 200-002-3|50-01-1|guanidínium-chlorid|CH5N3 -- http://mail.python.org/mailman/listinfo/python-list
Simple Text Processing Help
Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? Thank you, Patrick So far I have: #take tables in one text file and organize them into lines in another import codecs path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') #read and enter into a list chem_file = [] chem_file.append(input.read()) #split words and store them in a list for word in chem_file: words = word.split() #starting values in list e=0 #EINECS c=1 #CAS ch=2 #chemical name f=3 #formula n=0 loop=1 x=len(words) #counts how many words there are in the file print '-'*100 while loop==1: if nx and f=x: print words[e], '|', words[c], '|', words[ch], '|', words[f], '\n' output.write(words[e]) output.write('|') output.write(words[c]) output.write('|') output.write(words[ch]) output.write('|') output.write(words[f]) output.write('\r\n') #increase variables by 4 to get next set e = e + 4 c = c + 4 ch = ch + 4 f = f + 4 # increase by 1 to repeat n=n+1 else: loop=0 input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Sun, 14 Oct 2007 13:48:51 +, patrick.waldo wrote: Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: Is that in *one* line in the input file or two lines like shown here? 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? If the two elements before and the one element after the name can't contain spaces it is easy: take the first two and the last as it is and for the name take from the third to the next to last element = the name and join them with a space. In [202]: parts = '123 456 a name with spaces 789'.split() In [203]: parts[0] Out[203]: '123' In [204]: parts[1] Out[204]: '456' In [205]: ' '.join(parts[2:-1]) Out[205]: 'a name with spaces' In [206]: parts[-1] Out[206]: '789' This works too if the name doesn't have a space in it: In [207]: parts = '123 456 name 789'.split() In [208]: parts[0] Out[208]: '123' In [209]: parts[1] Out[209]: '456' In [210]: ' '.join(parts[2:-1]) Out[210]: 'name' In [211]: parts[-1] Out[211]: '789' #read and enter into a list chem_file = [] chem_file.append(input.read()) This reads the whole file and puts it into a list. This list will *always* just contain *one* element. So why a list at all!? #split words and store them in a list for word in chem_file: words = word.split() *If* the list would contain more than one element all would be processed but only the last is bound to `words`. You could leave out `chem_file` and the loop and simply do: words = input.read().split() Same effect but less chatty. ;-) The rest of the source seems to indicate that you don't really want to read in the whole input file at once but process it line by line, i.e. chemical element by chemical element. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? In the original file, is every chemical on a line of its own? I assume it is here. You might use a regexp (look at the re module), or I think here you can use the fact that only chemicals have spaces in them. Then, you can split each line on whitespace (like you're doing), and join back together all the words between the 3rd (ie index 2) and the last (ie index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses the somewhat unusual python syntax for replacing a section of a list with another list. The approach you took involves reading the whole file, and building a list of all the chemicals which you don't seem to use: I've changed it to a per-line version and removed the big lists. path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] chemical = u'|'.join(tokens) print chemical + u'\n' output.write(chemical + u'\r\n') input.close() output.close() Obviously, this isn't tested because I don't have your chem_1_utf8.txt file. -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
Thank you both for helping me out. I am still rather new to Python and so I'm probably trying to reinvent the wheel here. When I try to do Paul's response, I get tokens = line.strip().split() [] So I am not quite sure how to read line by line. tokens = input.read().split() gets me all the information from the file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like in the example; however, how can I loop this for the entire document? Also, when I try output.write(tokens), I get TypeError: coercing to Unicode: need string or buffer, list found. Any ideas? On Oct 14, 4:25 pm, Paul Hankin [EMAIL PROTECTED] wrote: On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? In the original file, is every chemical on a line of its own? I assume it is here. You might use a regexp (look at the re module), or I think here you can use the fact that only chemicals have spaces in them. Then, you can split each line on whitespace (like you're doing), and join back together all the words between the 3rd (ie index 2) and the last (ie index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses the somewhat unusual python syntax for replacing a section of a list with another list. The approach you took involves reading the whole file, and building a list of all the chemicals which you don't seem to use: I've changed it to a per-line version and removed the big lists. path = c:\\text_samples\\chem_1_utf8.txt path2 = c:\\text_samples\\chem_2.txt input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] chemical = u'|'.join(tokens) print chemical + u'\n' output.write(chemical + u'\r\n') input.close() output.close() Obviously, this isn't tested because I don't have your chem_1_utf8.txt file. -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Sun, 14 Oct 2007 16:57:06 +, patrick.waldo wrote: Thank you both for helping me out. I am still rather new to Python and so I'm probably trying to reinvent the wheel here. When I try to do Paul's response, I get tokens = line.strip().split() [] What is in `line`? Paul wrote this in the body of the ``for`` loop over all the lines in the file. So I am not quite sure how to read line by line. That's what the ``for`` loop over a file or file-like object is doing. Maybe you should develop your script in smaller steps and do some printing to see what you get at each step. For example after opening the input file: for line in input: print line # prints the whole line. tokens = line.split() print tokens # prints a list with the split line. tokens = input.read().split() gets me all the information from the file. Right it reads *all* of the file, not just one line. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like in the example; however, how can I loop this for the entire document? Don't read the whole file but line by line, just like Paul showed you. Also, when I try output.write(tokens), I get TypeError: coercing to Unicode: need string or buffer, list found. `tokens` is a list but you need to write a unicode string. So you have to reassemble the parts with '|' characters in between. Also shown by Paul. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? Your input file could be in one of THREE formats: (1) fields are separated by TAB characters (represented in Python by the escape sequence '\t', and equivalent to '\x09') (2) fields are fixed width and padded with spaces (3) fields are separated by a random number of whitespace characters (and can contain spaces). What makes you sure that you have format 3? You might like to try something like lines = open('your_file.txt').readlines()[:4] print lines print map(len, lines) This will print a *precise* representation of what is in the first four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 7, 3:50 am, George Sakkis [EMAIL PROTECTED] wrote: On Sep 5, 5:17 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: If this was a code golf challenge, I'd choose the Unix split solution and be both maintainable as well as concise :-) - Paddy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
Thanks for making me aware of the (UNIX) split command (split -l 5 inFile.txt), it's short, it's fast, it's beautiful. I am still wondering how to do this efficiently in Python (being kind of new to it... and it's not for homework). Something like this should do the job: def nlines(num, fileobj): done = [False] def doit(): for i in xrange(num): l = fileobj.readline() if not l: done[0] = True return yield l while not done[0]: yield doit() for i, group in enumerate(nlines(5, open('bigfile.txt'))): out = open('chunk_%d.txt' % i) for line in group: out.write(line) I am still wondering how to do this in Python (being new to Python) This is just one way of doing it, but not as concise as using split... Alberto -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
[EMAIL PROTECTED] escribió: I am still wondering how to do this efficiently in Python (being kind of new to it... and it's not for homework). You should post some code anyway, it would be easier to give useful advice (it would also demonstrate that you put some effort on it). Anyway, here is an option. Text-file objects are line-iterable, so you could use itertools (perhaps a bit difficult module for a newbie...): from itertools import islice, takewhile, repeat def take(it, n): return list(islice(it, n)) def readnlines(fd, n): return takewhile(bool, (take(fd, n) for _ in repeat(None))) def splitfile(path, prefix, nlines, suffix_digits): sformat = %%0%dd % suffix_digits for index, lines in enumerate(readnlines(file(path), nlines)): open(%s_%s%(prefix, sformat % index), w).writelines(lines) splitfile(/etc/services, out, 5, 4) arnau -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
Here's my solution, for what it's worth: #!/usr/bin/env python import os input = open(test.txt, r) counter = 0 fileNum = 0 fileName = def newFileName(): global fileNum, fileName while os.path.exists(fileName) or fileName == : fileNum += 1 x = %0.5d % fileNum fileName = %s.tmp % x return fileName for line in input: if (fileName == ) or (counter == 5): if fileName: output.close() fileName = newFileName() counter = 0 output = open(fileName, w) output.write(line) counter += 1 output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 5:17 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Sep 5, 1:28 pm, Paddy [EMAIL PROTECTED] wrote: On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines.From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. If its on unix: use split. If its your homework: show us what you have so far... - Paddy. Paddy, Thanks for making me aware of the (UNIX) split command (split -l 5 inFile.txt), it's short, it's fast, it's beautiful. I am still wondering how to do this efficiently in Python (being kind of new to it... and it's not for homework). -- Martin. I am still wondering how to do this in Python (being new to Python) If this was a code golf challenge, a decent entry (146 chars) could be: import itertools as it for i,g in it.groupby(enumerate(open('input.txt')),lambda(i,_):i/ 5):open(output.%d.txt%i,'w').writelines(s for _,s in g) or a bit less cryptically: import itertools as it for chunk,enum_lines in it.groupby(enumerate(open('input.txt')), lambda (i,line): i//5): open(output.%d.txt % chunk, 'w').writelines(line for _,line in enum_lines) George -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
Shawn Milochik wrote: On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines. From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. Maybe (untested): def read5Lines(f): L = f.readline() while L : yield (L,f.readline(),f.readline(),f.readline(),f.readline()) L = f.readline() in = open('C:\YourFile','rb') for fileNo, fiveLines in enumerate(read5Lines(in)) : out = open('c:\OutFile'+str(fileNo), 'wb') out.writelines(fiveLines) out.close() or something similar? (notice that in the last output file you may have a few (4 at most) blank lines) -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 11:13 am, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines.From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. I would use a counter in a for loop using the readline method to iterate over the 20,000 line file. Reset the counter every 5 lines/ iterations and close the file. To name files with unique names, use the time module. Something like this: x = 'filename-%s.txt' % time.time() Have fun! Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
[EMAIL PROTECTED] escribió: I have a text source file of about 20.000 lines. From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? Perhaps you could provide some code to see how you approached it? -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
[EMAIL PROTECTED] wrote: I would use a counter in a for loop using the readline method to iterate over the 20,000 line file. file objects are iterables themselves, so there's no need to do that by using a method. Reset the counter every 5 lines/ iterations and close the file. I'd use a generator that fetches five lines of the file per iteration and iterate over it instead of the file directly. Have fun! Definitely -- and also do your homework yourself :) Regards, Björn -- BOFH excuse #339: manager in the cable duct -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines. From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. I have written a working test of this. Here's the basic setup: open the input file function newFileName: generate a filename (starting with 1.tmp). If filename exists, increment and test again (0002.tmp and so on). return fileName read a line until input file is empty: test to see whether I have written five lines. If so, get a new file name, close file, and open new file write line to file close output file final time Once you get some code running, feel free to post it and we'll help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 11:57 am, Bjoern Schliessmann usenet- [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I would use a counter in a for loop using the readline method to iterate over the 20,000 line file. file objects are iterables themselves, so there's no need to do that by using a method. Very true! Darn it! Reset the counter every 5 lines/ iterations and close the file. I'd use a generator that fetches five lines of the file per iteration and iterate over it instead of the file directly. I still haven't figured out how to use generators, so this didn't even come to mind. I usually see something like this example for reading a file: f = open(somefile) for line in f: # do something http://docs.python.org/tut/node9.html Okay, so they didn't use readline. I wonder where I saw that. Have fun! Definitely -- and also do your homework yourself :) Regards, Björn -- BOFH excuse #339: manager in the cable duct Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
[EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines. From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? You should use a nested loop. In advance, thanks for your help. You're welcome. -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines.From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. If its on unix: use split. If its your homework: show us what you have so far... - Paddy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 1:28 pm, Paddy [EMAIL PROTECTED] wrote: On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines.From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. If its on unix: use split. If its your homework: show us what you have so far... - Paddy. Paddy, Thanks for making me aware of the (UNIX) split command (split -l 5 inFile.txt), it's short, it's fast, it's beautiful. I am still wondering how to do this efficiently in Python (being kind of new to it... and it's not for homework). -- Martin. I am still wondering how to do this in Python (being new to Python) -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines.From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? Sure! In advance, thanks for your help. from my_useful_functions import new_file, write_first_5_lines, done_processing_file, grab_next_5_lines, another_new_file, write_these in_f = open('myfile') out_f = new_file() write_first_5_lines(in_f, out_f) # write first 5 lines close(out_f) while not done_processing_file(in_f): # until done processing lines = grab_next_5_lines(in_f) # grab next 5 lines out_f = another_new_file() write_these(lines, out_f) # write these close(out_f) print all done! # All done print Now there are 4000 files in this directory... Python 3.0 - ready (I've used open() instead of file()) HTH -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
Arnaud Delobelle wrote: [...] from my_useful_functions import new_file, write_first_5_lines, done_processing_file, grab_next_5_lines, another_new_file, write_these in_f = open('myfile') out_f = new_file() write_first_5_lines(in_f, out_f) # write first 5 lines close(out_f) while not done_processing_file(in_f): # until done processing lines = grab_next_5_lines(in_f) # grab next 5 lines out_f = another_new_file() write_these(lines, out_f) # write these close(out_f) print all done! # All done print Now there are 4000 files in this directory... Python 3.0 - ready (I've used open() instead of file()) bzzt! Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) ... Type help, copyright, credits or license for more information. print all done! # All done File stdin, line 1 print all done! # All done ^ SyntaxError: invalid syntax Close, but no cigar ;-) regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden --- Asciimercial -- Get on the web: Blog, lens and tag the Internet Many services currently offer free registration --- Thank You for Reading - -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
file reading latency is mainly caused by large reading frequency, so reduction of the frequency of file reading may be way to solved your problem. u may specify an read bytes count for python file object's read() method, some large value(like 65536) can be specified due to ur memory usage, and u can parse lines from read buffer freely. have fun! - Original Message - From: Shawn Milochik [EMAIL PROTECTED] To: python-list@python.org Sent: Thursday, September 06, 2007 1:03 AM Subject: Re: Text processing and file creation On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I have a text source file of about 20.000 lines. From this file, I like to write the first 5 lines to a new file. Close that file, grab the next 5 lines write these to a new file... grabbing 5 lines and creating new files until processing of all 20.000 lines is done. Is there an efficient way to do this in Python? In advance, thanks for your help. I have written a working test of this. Here's the basic setup: open the input file function newFileName: generate a filename (starting with 1.tmp). If filename exists, increment and test again (0002.tmp and so on). return fileName read a line until input file is empty: test to see whether I have written five lines. If so, get a new file name, close file, and open new file write line to file close output file final time Once you get some code running, feel free to post it and we'll help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Text processing and file creation
On Sep 6, 12:46 am, Steve Holden [EMAIL PROTECTED] wrote: Arnaud Delobelle wrote: [...] print all done! # All done print Now there are 4000 files in this directory... Python 3.0 - ready (I've used open() instead of file()) bzzt! Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) ... Type help, copyright, credits or license for more information. print all done! # All done File stdin, line 1 print all done! # All done ^ SyntaxError: invalid syntax Damn! That'll teach me to make such bold claims. At least I'm unlikely to forget again now... -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
Re: On text processing
I'm in a process of rewriting a bash/awk/sed script -- that grew to big -- in python. I can rewrite it in a simple line-by-line way but that results in ugly python code and I'm sure there is a simple pythonic way. The bash script processed text files of the form: ### key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 ### I guess you get the point. If a line has two entries it is a key/value pair which should end up in a dictionary. If a key/value pair is followed by consequtive lines with more then two entries, it is a matrix that should end up in a list of lists (matrix) that can be identified by the key preceeding it. The empty line after the last line of a matrix signifies that the matrix is finished and we are back to a key/value situation. Note that a matrix is always preceeded by a key/value pair so that it can really be identified by the key. Any elegant solution for this? My solution expects correctly formatted input and parses it into separate key/value and matrix holding dicts: from StringIO import StringIO fileText = '''\ key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 ''' infile = StringIO(fileText) keyvalues = {} matrices = {} for line in infile: fields = line.strip().split() if len(fields) == 2: keyvalues[fields[0]] = fields[1] lastkey = fields[0] elif fields: matrices.setdefault(lastkey, []).append(fields) == Here is the sample output: from pprint import pprint as pp pp(keyvalues) {'key1': 'value1', 'key2': 'value2', 'key3': 'value3', 'key4': 'value4', 'key5': 'value5', 'key6': 'value6', 'key7': 'value7', 'key8': 'value8'} pp(matrices) {'key4': [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21', 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33', 'spec34']], 'key7': [['more11', 'more12', 'more13'], ['more21', 'more22', 'more23']]} Paddy, thanks, this looks even better. Paul, pyparsing looks like an overkill, even the config parser module is something that is too complex for me for such a simple task. The text files are actually input files to a program and will never be longer than 20-30 lines so Paddy's solution is perfectly fine. In any case it's good to know that there exists a module called pyparsing :) -- http://mail.python.org/mailman/listinfo/python-list
On text processing
Hi list, I'm in a process of rewriting a bash/awk/sed script -- that grew to big -- in python. I can rewrite it in a simple line-by-line way but that results in ugly python code and I'm sure there is a simple pythonic way. The bash script processed text files of the form: ### key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 ### I guess you get the point. If a line has two entries it is a key/value pair which should end up in a dictionary. If a key/value pair is followed by consequtive lines with more then two entries, it is a matrix that should end up in a list of lists (matrix) that can be identified by the key preceeding it. The empty line after the last line of a matrix signifies that the matrix is finished and we are back to a key/value situation. Note that a matrix is always preceeded by a key/value pair so that it can really be identified by the key. Any elegant solution for this? -- http://mail.python.org/mailman/listinfo/python-list
Re: On text processing
Daniel Nogradi: Any elegant solution for this? This is my first try: ddata = {} inside_matrix = False for row in file(data.txt): if row.strip(): fields = row.split() if len(fields) == 2: inside_matrix = False ddata[fields[0]] = [fields[1]] lastkey = fields[0] else: if inside_matrix: ddata[lastkey][1].append(fields) else: ddata[lastkey].append([fields]) inside_matrix = True # This gives some output for testing only: for k in sorted(ddata): print k, ddata[k] Input file data.txt: key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 The output: key1 ['value1'] key2 ['value2'] key3 ['value3'] key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21', 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33', 'spec34']]] key5 ['value5'] key6 ['value6'] key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22', 'more23']]] key8 ['value8'] If there are many simple keys, then you can avoid creating a single element list for them, but then you have to tell apart the two cases on the base of the key (while now the presence of the second element is able to tell apart the two situations). You can also use two different dicts to keep the two different kinds of data. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: On text processing
This is my first try: ddata = {} inside_matrix = False for row in file(data.txt): if row.strip(): fields = row.split() if len(fields) == 2: inside_matrix = False ddata[fields[0]] = [fields[1]] lastkey = fields[0] else: if inside_matrix: ddata[lastkey][1].append(fields) else: ddata[lastkey].append([fields]) inside_matrix = True # This gives some output for testing only: for k in sorted(ddata): print k, ddata[k] Input file data.txt: key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 The output: key1 ['value1'] key2 ['value2'] key3 ['value3'] key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21', 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33', 'spec34']]] key5 ['value5'] key6 ['value6'] key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22', 'more23']]] key8 ['value8'] If there are many simple keys, then you can avoid creating a single element list for them, but then you have to tell apart the two cases on the base of the key (while now the presence of the second element is able to tell apart the two situations). You can also use two different dicts to keep the two different kinds of data. Bye, bearophile Thanks very much, it's indeed quite simple. I was lost in the itertools documentation :) -- http://mail.python.org/mailman/listinfo/python-list
Re: On text processing
On Mar 23, 10:30 pm, Daniel Nogradi [EMAIL PROTECTED] wrote: Hi list, I'm in a process of rewriting a bash/awk/sed script -- that grew to big -- in python. I can rewrite it in a simple line-by-line way but that results in ugly python code and I'm sure there is a simple pythonic way. The bash script processed text files of the form: ### key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 ### I guess you get the point. If a line has two entries it is a key/value pair which should end up in a dictionary. If a key/value pair is followed by consequtive lines with more then two entries, it is a matrix that should end up in a list of lists (matrix) that can be identified by the key preceeding it. The empty line after the last line of a matrix signifies that the matrix is finished and we are back to a key/value situation. Note that a matrix is always preceeded by a key/value pair so that it can really be identified by the key. Any elegant solution for this? My solution expects correctly formatted input and parses it into separate key/value and matrix holding dicts: from StringIO import StringIO fileText = '''\ key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 ''' infile = StringIO(fileText) keyvalues = {} matrices = {} for line in infile: fields = line.strip().split() if len(fields) == 2: keyvalues[fields[0]] = fields[1] lastkey = fields[0] elif fields: matrices.setdefault(lastkey, []).append(fields) == Here is the sample output: from pprint import pprint as pp pp(keyvalues) {'key1': 'value1', 'key2': 'value2', 'key3': 'value3', 'key4': 'value4', 'key5': 'value5', 'key6': 'value6', 'key7': 'value7', 'key8': 'value8'} pp(matrices) {'key4': [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21', 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33', 'spec34']], 'key7': [['more11', 'more12', 'more13'], ['more21', 'more22', 'more23']]} - Paddy. -- http://mail.python.org/mailman/listinfo/python-list
Re: On text processing
On Mar 23, 5:30 pm, Daniel Nogradi [EMAIL PROTECTED] wrote: Hi list, I'm in a process of rewriting a bash/awk/sed script -- that grew to big -- in python. I can rewrite it in a simple line-by-line way but that results in ugly python code and I'm sure there is a simple pythonic way. The bash script processed text files of the form... Any elegant solution for this? Is a parser overkill? Here's how you might use pyparsing for this problem. I just wanted to show that pyparsing's returned results can be structured as more than just lists of tokens. Using pyparsing's Dict class (or the dictOf helper that simplifies using Dict), you can return results that can be accessed like a nested list, like a dict, or like an instance with named attributes (see the last line of the example). You can adjust the syntax definition of keys and values to fit your actual data, for instance, if the matrices are actually integers, then define the matrixRow as: matrixRow = Group( OneOrMore( Word(nums) ) ) + eol -- Paul from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums, \ Group, ZeroOrMore, OneOrMore, Optional, dictOf data = key1value1 key2value2 key3value3 key4value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5value5 key6value6 key7value7 more11 more12 more13 more21 more22 more23 key8value8 # retain significant newlines (pyparsing reads over whitespace by default) ParserElement.setDefaultWhitespaceChars( \t) eol = LineEnd().suppress() elem = Word(alphas,alphanums) key = elem matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol matrix = Group( OneOrMore( matrixRow ) ) + eol value = elem + eol + Optional( matrix ) + ZeroOrMore(eol) parser = dictOf(key, value) # parse the data results = parser.parseString(data) # access the results # - like a dict # - like a list # - like an instance with keys for attributes print results.keys() print for k in sorted(results.keys()): print k, if isinstance( results[k], basestring ): print results[k] else: print results[k][0] for row in results[k][1]: print, .join(row) print print results.key3 Prints out: ['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4'] key1 value1 key2 value2 key3 value3 key4 value4 spec11 spec12 spec13 spec14 spec21 spec22 spec23 spec24 spec31 spec32 spec33 spec34 key5 value5 key6 value6 key7 value7 more11 more12 more13 more21 more22 more23 key8 value8 value3 -- http://mail.python.org/mailman/listinfo/python-list
Suitability for long-running text processing?
I have a pair of python programs that parse and index files on my computer to make them searchable. The problem that I have is that they continually grow until my system is out of memory, and then things get ugly. I remember, when I was first learning python, reading that the python interpreter doesn't gc small strings, but I assumed that was outdated and sort of forgot about it. Unfortunately, it seems this is still the case. A sample program (to type/copy and paste into the python REPL): a=[] for i in xrange(33,127): for j in xrange(33,127): for k in xrange(33,127): for l in xrange(33, 127): a.append(chr(i)+chr(j)+chr(k)+chr(l)) del(a) import gc gc.collect() The loop is deep enough that I always interrupt it once python's size is around 250 MB. Once the gc.collect() call is finished, python's size has not changed a bit. Even though there are no locals, no references at all to all the strings that were created, python will not reduce its size. This example is obviously artificial, but I am getting the exact same behaviour in my real programs. Is there some way to convince python to get rid of all the data that is no longer referenced, or do I need to use a different language? This has been tried under python 2.4.3 in gentoo linux and python 2.3 under OS X.3. Any suggestions/work arounds would be much appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
After reading http://www.python.org/doc/faq/general/#how-does-python-manage-memory, I tried modifying this program as below: a=[] for i in xrange(33,127): for j in xrange(33,127): for k in xrange(33,127): for l in xrange(33, 127): a.append(chr(i)+chr(j)+chr(k)+chr(l)) import sys sys.exc_clear() sys.exc_traceback = sys.last_traceback = None del(a) import gc gc.collect() And it still never frees up its memory. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
On 1/8/07, tsuraan [EMAIL PROTECTED] wrote: [snip] The loop is deep enough that I always interrupt it once python's size is around 250 MB. Once the gc.collect() call is finished, python's size has not changed a bit. [snip] This has been tried under python 2.4.3 in gentoo linux and python 2.3 under OS X.3. Any suggestions/work arounds would be much appreciated. I just tried on my system (Python is using 2.9 MiB) a = ['a' * (1 20) for i in xrange(300)] (Python is using 304.1 MiB) del a (Python is using 2.9 MiB -- as before) And I didn't even need to tell the garbage collector to do its job. Some info: $ cat /etc/issue Ubuntu 6.10 \n \l $ uname -r 2.6.19-ck2 $ python -V Python 2.4.4c1 -- Felipe. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
I just tried on my system (Python is using 2.9 MiB) a = ['a' * (1 20) for i in xrange(300)] (Python is using 304.1 MiB) del a (Python is using 2.9 MiB -- as before) And I didn't even need to tell the garbage collector to do its job. Some info: It looks like the big difference between our two programs is that you have one huge string repeated 300 times, whereas I have thousands of four-character strings. Are small strings ever collected by python? -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
On 1/8/07, tsuraan [EMAIL PROTECTED] wrote: I just tried on my system (Python is using 2.9 MiB) a = ['a' * (1 20) for i in xrange(300)] (Python is using 304.1 MiB) del a (Python is using 2.9 MiB -- as before) And I didn't even need to tell the garbage collector to do its job. Some info: It looks like the big difference between our two programs is that you have one huge string repeated 300 times, whereas I have thousands of four-character strings. Are small strings ever collected by python? In my test there are 300 strings of 1 MiB, not a huge string repeated. However: $ python Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2 Type help, copyright, credits or license for more information. # Python is using 2.7 MiB ... a = ['1234' for i in xrange(10 20)] # Python is using 42.9 MiB ... del a # Python is using 2.9 MiB With 10,485,760 strings of 4 chars, it still works as expected. -- Felipe. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
On 1/8/07, Felipe Almeida Lessa [EMAIL PROTECTED] wrote: On 1/8/07, tsuraan [EMAIL PROTECTED] wrote: I just tried on my system (Python is using 2.9 MiB) a = ['a' * (1 20) for i in xrange(300)] (Python is using 304.1 MiB) del a (Python is using 2.9 MiB -- as before) And I didn't even need to tell the garbage collector to do its job. Some info: It looks like the big difference between our two programs is that you have one huge string repeated 300 times, whereas I have thousands of four-character strings. Are small strings ever collected by python? In my test there are 300 strings of 1 MiB, not a huge string repeated. However: $ python Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2 Type help, copyright, credits or license for more information. # Python is using 2.7 MiB ... a = ['1234' for i in xrange(10 20)] # Python is using 42.9 MiB ... del a # Python is using 2.9 MiB With 10,485,760 strings of 4 chars, it still works as expected. -- Felipe. -- Have you actually ran the OPs code? It has clearly different behavior than what you are posting, and the OPs code, to me at least, seems much more representative of real-world code. In your second case, you have the *same* string 10,485,760 times, in the OPs case each string is different. My first thought was that interned strings were causing the growth, but that doesn't seem to be the case. Regardless, what he's posting is clearly different, and has different behavior, than what he is posting. If you don't see the memory leak when you run the code he posted (the *same* code) that'd be important information. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
$ python Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2 Type help, copyright, credits or license for more information. # Python is using 2.7 MiB ... a = ['1234' for i in xrange(10 20)] # Python is using 42.9 MiB ... del a # Python is using 2.9 MiB With 10,485,760 strings of 4 chars, it still works as expected. Have you tried running the code I posted? Is there any explanation as to why the code I posted fails to ever be cleaned up? In your specific example, you have a huge array of pointers to a single string. Try doing a[0] is a[1]. You'll get True. Try a[0] is '1'+'2'+'3'+'4'. You'll get False. Every element of a is a pointer to the exact same string. When you delete a, you're getting rid of a huge array of pointers, but probably not actually losing the four-byte (plus gc overhead) string '1234'. So, does anybody know how to get python to free up _all_ of its allocated strings? -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
My first thought was that interned strings were causing the growth, but that doesn't seem to be the case. Interned strings, as of 2.3, are no longer immortal, right? The intern doc says you have to keep a reference around to the string now, anyhow. I really wish I could find that thing I read a year and a half ago about python never collecting small strings, but I just can't find it anymore. Maybe it's time for me to go source diving... -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
On 1/8/07, tsuraan [EMAIL PROTECTED] wrote: My first thought was that interned strings were causing the growth, but that doesn't seem to be the case. Interned strings, as of 2.3, are no longer immortal, right? The intern doc says you have to keep a reference around to the string now, anyhow. I really wish I could find that thing I read a year and a half ago about python never collecting small strings, but I just can't find it anymore. Maybe it's time for me to go source diving... I remember something about it coming up in some of the discussions of free lists and better behavior in this regard in 2.5, but I don't remember the details. Interned strings aren't supposed to be immortal, these strings shouldn't be automatically interned anyway (and my brief testing seemed to bear that out) and calling _Py_ReleaseInternedStrings didn't recover any memory, so I'm pretty sure interning is not the culprit. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suitability for long-running text processing?
I remember something about it coming up in some of the discussions of free lists and better behavior in this regard in 2.5, but I don't remember the details. Under Python 2.5, my original code posting no longer exhibits the bug - upon calling del(a), python's size shrinks back to ~4 MB, which is its starting size. I guess I'll see how painful it is to migrate a gentoo system to 2.5... Thanks for the hint :) -- http://mail.python.org/mailman/listinfo/python-list
Beginner question on text processing
I am beginning to use python primarily to organize data into formats needed for input into some statistical packages. I do not have much programming experience outside of LaTeX and R, so some of this is a bit new. I am attempting to write a program that reads in a text file that contains some values and it would then output a new file that has manipulated this original text file in some manner. To illustrate, assume I have a text file, call it test.txt, with the following information: X11 .32 X22 .45 My goal in the python program is to manipulate this file such that a new file would be created that looks like: X11 IPB = .32 X22 IPB = .45 Here is what I have accomplished so far. # Python code below for sample program called 'test.py' # Read in a file with the item parameters filename = raw_input(Please enter the file you want to open: ) params = open(filename, 'r') for i in params: print 'IPB = ' ,i # end code This obviously results in the following: IPB = x11 .32 IPB = x22 .45 So, my questions may be trivial, but: 1) How do I print the 'IPB = ' before the numbers? 2) Is there a better way to prompt the user to open the desired file rather than the way I have it above? For example, is there a built-in function that would open a windows dialogue box such that a user who does not know about path names can use windows to look for the file and click on it. 3) Last, what is the best way to have the output saved as a new file called 'test2.txt'. The only way I know how to do this now is to do something like: python test.py test2.txt Thank you for any help -- http://mail.python.org/mailman/listinfo/python-list
Re: Beginner question on text processing
Harold To illustrate, assume I have a text file, call it test.txt, with Harold the following information: Harold X11 .32 Harold X22 .45 Harold My goal in the python program is to manipulate this file such Harold that a new file would be created that looks like: Harold X11 IPB = .32 Harold X22 IPB = .45 ... This is a problem with a number of different solutions. Here's one way to do it: for line in open(filename, r): fields = line.split() print fields[0], IPB =, fields[1] Skip -- http://mail.python.org/mailman/listinfo/python-list
fast text processing
(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to scan very large ASCII files (80GB+), compare adjacent lines, and conditionally do some further processing. I believe the disk i/o is the main bottleneck so for now that's what I'm optimizing. What I have now is roughly as follows (on python 2.3.5). filehandle = open(data,'r',buffering=1000) lastLine = filehandle.readline() for currentLine in filehandle.readlines(): lastTokens = lastLine.strip().split(delimiter) currentTokens = currentLine.strip().split(delimiter) lastGeno = extract(lastTokens[0]) currentGeno = extract(currentTokens[0]) # prepare for next iteration lastLine = currentLine if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]),int(currentTokens[1])) So on every iteration I'm processing mutable strings -- this seems wrong. What's the best way to speed this up? Can I switch to some fast byte-oriented immutable string library? Are there optimizing compilers? Are there better ways to prep the file handle? Perhaps this is a job for C, but I am of that soft generation which fears memory management. I'd need to learn how to do buffered reading in C, how to wrap the C in python, and how to let the C call back into python to call markEquivalent(). It sounds painful. I _have_ done some benchmark comparisons of only the underlying line-based file reading against a Common Lisp version, but I doubt I'm using the optimal construct in either language so I hesitate to trust my results, and anyway the interlanguage bridge will be even more obscure in that case. Much obliged for any help, Alexis -- http://mail.python.org/mailman/listinfo/python-list
Re: fast text processing
Alexis Gallagher wrote: (I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to scan very large ASCII files (80GB+), compare adjacent lines, and conditionally do some further processing. I believe the disk i/o is the main bottleneck so for now that's what I'm optimizing. What I have now is roughly as follows (on python 2.3.5). filehandle = open(data,'r',buffering=1000) This buffer size seems, shall we say, unadventurous? It's likely to slow things down considerably, since the filesystem is probably going to naturally wnt to use a rather larger value. I'd suggest a 64k minumum. lastLine = filehandle.readline() I'd suggest lastTokens = filehandle.readline().strip().split(delimiter) here. You have no need of the line other than to split it into tokens. for currentLine in filehandle.readlines(): Note that this is going to read the whole file in to (virtual) memory before entering the loop. I somehow suspect you'd rather avoid this if you could. I further suspect your testing has been with smaller files than 80GB ;-). You might want to consider for currentLine in filehandle: as an alternative. This uses the file's generator properties to produce the next line each time round the loop. lastTokens = lastLine.strip().split(delimiter) The line above goes away if you adopt the loop initialization suggestion above. Otherwise you are repeating the splitting of each line twice, once as the current line then again as the last line. currentTokens = currentLine.strip().split(delimiter) lastGeno = extract(lastTokens[0]) currentGeno = extract(currentTokens[0]) If the extract() operation is stateless (in other words if it always produces the same output for a given input) then again you are unecessarily repeating yourself here by running extract() on the same data as the current first token and the last first token (if you see what I mean). I might also observe that you seem to expect only two tokens per line. If this is invariable the case then you might want to consider writing an unpacking assignment instead, such as cToken0, cToken1, newline = currentLine.strip().split(delimiter) to save the indexing. Not a big deal, thugh, and it *will* break if you have more than one delimiter in a line, as the interpreter won;t then know what to do with the third and subsequent elements. # prepare for next iteration lastLine = currentLine Of course now you are going to try and strip the delimiter off the line and split it again when you loop around again. You should now just be able to say lastTokens = currentTokens instead. if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]),int(currentTokens[1])) So on every iteration I'm processing mutable strings -- this seems wrong. What's the best way to speed this up? Can I switch to some fast byte-oriented immutable string library? Are there optimizing compilers? Are there better ways to prep the file handle? I'm sorry but I am not sure where the mutable strings come in. Python strings are immutable anyway. Well-known for it. It might be a slight problem that you are creating a second terminator-less copy of each line, but that's probably inevitable. Of course you leave us in the dark about the nature of table.markEquivalent as well. Depending on the efficiency of the extract() operation you might want to consider short-circuiting the loop if the two tokens have already been marked as equivalent. That might be a big win or not depending on relative efficiency. Perhaps this is a job for C, but I am of that soft generation which fears memory management. I'd need to learn how to do buffered reading in C, how to wrap the C in python, and how to let the C call back into python to call markEquivalent(). It sounds painful. I _have_ done some benchmark comparisons of only the underlying line-based file reading against a Common Lisp version, but I doubt I'm using the optimal construct in either language so I hesitate to trust my results, and anyway the interlanguage bridge will be even more obscure in that case. Probably the biggest gain will be in simply not reading the whole file into memory by calling its .readlines() method. Summarising. consider something more like: filehandle = open(data,'r',buffering=64*1024) # You could also try just leaving the buffering spec out lastTokens = filehandle.readline().strip().split(delim) lastGeno = extract(lastTokens[0]) for currentLine in filehandle: currentTokens = currentLine.strip().split(delim) currentGeno = extract(currentTokens[0]) if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]), int(currentTokens[1])) lastGeno = currentGeno lastTokens = currentTokens Much obliged for any help,
Re: fast text processing
Maybe this code will be faster? (If it even does the same thing: largely untested) filehandle = open(data,'r',buffering=1000) fileIter = iter(filehandle) lastLine = fileIter.next() lastTokens = lastLine.strip().split(delimiter) lastGeno = extract(lastTokens[0]) for currentLine in fileIter: currentTokens = currentLine.strip().split(delimiter) currentGeno = extract(currentTokens[0]) if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]),int(currentTokens[1])) # prepare for next iteration lastLine = currentLine lastTokens = currentTokens lastGeno = currentGeno I'd be tempted to try a bigger file buffer too, personally. -- Ben Sizer -- http://mail.python.org/mailman/listinfo/python-list
Re: fast text processing
Steve, First, many thanks! Steve Holden wrote: Alexis Gallagher wrote: filehandle = open(data,'r',buffering=1000) This buffer size seems, shall we say, unadventurous? It's likely to slow things down considerably, since the filesystem is probably going to naturally wnt to use a rather larger value. I'd suggest a 64k minumum. Good to know. I should have dug into the docs deeper. Somehow I thought it listed lines not bytes. for currentLine in filehandle.readlines(): Note that this is going to read the whole file in to (virtual) memory before entering the loop. I somehow suspect you'd rather avoid this if you could. I further suspect your testing has been with smaller files than 80GB ;-). You might want to consider Oops! Thanks again. I thought that readlines() was the generator form, based on the docstring comments about the deprecation of xreadlines(). So on every iteration I'm processing mutable strings -- this seems wrong. What's the best way to speed this up? Can I switch to some fast byte-oriented immutable string library? Are there optimizing compilers? Are there better ways to prep the file handle? I'm sorry but I am not sure where the mutable strings come in. Python strings are immutable anyway. Well-known for it. I misspoke. I think was mixing this up with the issue of object-creation overhead for all of the string handling in general. Is this a bottleneck to string processing in python, or is this a hangover from my Java days? I would have thought that dumping the standard string processing libraries in favor of byte manipulation would have been one of the biggest wins. Of course you leave us in the dark about the nature of table.markEquivalent as well. markEquivalent() implements union-join (aka, uptrees) to generate equivalence classes. Optimising that was going to be my next task I feel a bit silly for missing the double-processing of everything. Thanks for pointing that out. And I will check out the biopython package. I'm still curious if optimizing compilers are worth examining. For instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm guessing that both this tokenizing and the uptree implementations sound like good candidates for one of those tools, once I shake out these algorithmic problems. alexis -- http://mail.python.org/mailman/listinfo/python-list
Re: fast text processing
Alexis Gallagher wrote: Steve, First, many thanks! Steve Holden wrote: Alexis Gallagher wrote: filehandle = open(data,'r',buffering=1000) This buffer size seems, shall we say, unadventurous? It's likely to slow things down considerably, since the filesystem is probably going to naturally wnt to use a rather larger value. I'd suggest a 64k minumum. Good to know. I should have dug into the docs deeper. Somehow I thought it listed lines not bytes. for currentLine in filehandle.readlines(): Note that this is going to read the whole file in to (virtual) memory before entering the loop. I somehow suspect you'd rather avoid this if you could. I further suspect your testing has been with smaller files than 80GB ;-). You might want to consider Oops! Thanks again. I thought that readlines() was the generator form, based on the docstring comments about the deprecation of xreadlines(). So on every iteration I'm processing mutable strings -- this seems wrong. What's the best way to speed this up? Can I switch to some fast byte-oriented immutable string library? Are there optimizing compilers? Are there better ways to prep the file handle? I'm sorry but I am not sure where the mutable strings come in. Python strings are immutable anyway. Well-known for it. I misspoke. I think was mixing this up with the issue of object-creation overhead for all of the string handling in general. Is this a bottleneck to string processing in python, or is this a hangover from my Java days? I would have thought that dumping the standard string processing libraries in favor of byte manipulation would have been one of the biggest wins. Of course you leave us in the dark about the nature of table.markEquivalent as well. markEquivalent() implements union-join (aka, uptrees) to generate equivalence classes. Optimising that was going to be my next task I feel a bit silly for missing the double-processing of everything. Thanks for pointing that out. And I will check out the biopython package. I'm still curious if optimizing compilers are worth examining. For instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm guessing that both this tokenizing and the uptree implementations sound like good candidates for one of those tools, once I shake out these algorithmic problems. alexis When your problem is I/O bound there is almost nothing that can be done to speed it up without some sort of refactoring of the input data itself. Python reads bytes off a hard drive just as fast as any compiled language. A good test is to copy the file and measure the time. You can't make your program run any faster than a copy of the file itself without making hardware changes (e.g. RAID arrays, etc.). You might also want to take a look at csv module. Reading lines and splitting on delimeters is almost always handled well by csv. -Larry Bates -- http://mail.python.org/mailman/listinfo/python-list
Newbie Text Processing Question
Hi, I'm a total newbie to Python so any and all advice is greatly appreciated. I'm trying to use regular expressions to process text in an SGML file but only in one section. So the input would look like this: ch-part no=ItitleRESEARCH GUIDE sec-main no=1.01titlecontent paracontent sec-main no=2.01titlecontent paracontent ch-part no=IItitleFORMS sec-main no=3.01titlecontent sec-sub1 no=1titlecontent paracontent sec-sub2 no=1titlecontent paracontent and the output like this: ch-part no=ItitleRESEARCH GUIDE sec-main no=1.01titlecontent biblio paracontent /biblio sec-main no=2.01titlecontent biblio paracontent /biblio ch-part no=IItitleFORMS sec-main no=3.01titlecontent sec-sub1 no=1titlecontent paracontent sec-sub2 no=1titlecontent paracontent But no matter what I try I end up changing the entire file rather than just one part. Here's what I've come up with so far but I can't think of anything else. *** import os, re setpath = raw_input(Enter the path where the program should run: ) print for root, dirs, files in os.walk(setpath): fname = files for fname in files: inputFile = file(os.path.join(root,fname), 'r') line = inputFile.read() inputFile.close() chpart_pattern = re.compile(r'ch-part no=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE) while 1: if chpart_pattern.search(line): line = re.sub(rsec-main no=(\[0-9]*.[0-9]*\)title(.*), rsec-main no=\1title\2\nbiblio, line) outputFile = file(os.path.join(root,fname), 'w') outputFile.write(line) outputFile.close() break if chpart_pattern.search(line) is None: print 'none' break Thanks, Greg -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie Text Processing Question
That's how Python works. You read in the whole file, edit it, and write it back out. As far as I know there's no way to edit a file in place which I'm assuming is what you're asking? And now, cue the responses telling you to use a fancy parser (XML?) for your project ;-) -Greg On 4 Oct 2005 20:04:39 -0700, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:Hi,I'm a total newbie to Python so any and all advice is greatly appreciated.I'm trying to use regular expressions to process text in an SGML filebut only in one section.So the input would look like this:ch-part no=ItitleRESEARCH GUIDE sec-main no=1.01titlecontentparacontentsec-main no=2.01titlecontentparacontentch-part no=IItitleFORMS sec-main no=3.01titlecontentsec-sub1 no=1titlecontentparacontentsec-sub2 no=1titlecontentparacontent and the output like this:ch-part no=ItitleRESEARCH GUIDEsec-main no=1.01titlecontentbiblioparacontent/biblio sec-main no=2.01titlecontentbiblioparacontent/biblioch-part no=IItitleFORMSsec-main no=3.01titlecontent sec-sub1 no=1titlecontentparacontentsec-sub2 no=1titlecontentparacontentBut no matter what I try I end up changing the entire file rather than just one part.Here's what I've come up with so far but I can't think of anythingelse.***import os, resetpath = raw_input(Enter the path where the program should run: )print for root, dirs, files in os.walk(setpath): fname = files for fname in files:inputFile = file(os.path.join(root,fname), 'r')line = inputFile.read()inputFile.close ()chpart_pattern = re.compile(r'ch-partno=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE)while 1: if chpart_pattern.search(line):line = re.sub(rsec-mainno=(\[0-9]*.[0-9]*\)title(.*), rsec-mainno=\1title\2\nbiblio, line)outputFile = file(os.path.join(root,fname), 'w')outputFile.write(line)outputFile.close()break if chpart_pattern.search(line) is None: print 'none'breakThanks,Greg--http://mail.python.org/mailman/listinfo/python-list -- Gregory PiñeroChief Innovation OfficerBlended Technologies(www.blendedtechnologies.com) -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie Text Processing Question
You can edit a file in place, but it is not applicable to what you are doing. As soon as you insert the first biblio, you've shifted everything downstream by those 8 bytes. Since they map to a physically located blocks on a physical drive, you will have to rewrite those blocks. If it is a big file you can do something conceptually similar to piping, where the original file is read in line by line and a new file is created: afile = open(somefile.xml) newfile = open(somenewfile.xml, w) for aline in afile: if tests_positive(aline): newfile.write(make_the_prelude(aline)) newfile.write(aline) newfile.write(make_the_afterlude(aline)) else: newfile.write(aline) afile.close() newfile.close() James On Tuesday 04 October 2005 20:13, Gregory Piñero wrote: That's how Python works. You read in the whole file, edit it, and write it back out. As far as I know there's no way to edit a file in place which I'm assuming is what you're asking? -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie Text Processing Question
[EMAIL PROTECTED] writes: I'm a total newbie to Python so any and all advice is greatly appreciated. Well, I've got some for you. I'm trying to use regular expressions to process text in an SGML file but only in one section. This is generally a bad idea. SGML family languages aren't easy to parse - even the ones that were designed to be easy to parse - and generally require very complex regular expessions to get right. It may be that your SGML data can be parsed by the re you use, but there are almost certainly valid SGML documents that your parser will not properly parse. In general, it's better to use a parser for the language in question. So the input would look like this: ch-part no=ItitleRESEARCH GUIDE sec-main no=1.01titlecontent paracontent sec-main no=2.01titlecontent paracontent ch-part no=IItitleFORMS sec-main no=3.01titlecontent sec-sub1 no=1titlecontent paracontent sec-sub2 no=1titlecontent paracontent This is funny-looking SGML. Are the the end tags really optional for all the tag types? But no matter what I try I end up changing the entire file rather than just one part. Other have explained why you can't do that, so I'll skip it. Here's what I've come up with so far but I can't think of anything else. *** import os, re setpath = raw_input(Enter the path where the program should run: ) print for root, dirs, files in os.walk(setpath): fname = files for fname in files: inputFile = file(os.path.join(root,fname), 'r') line = inputFile.read() inputFile.close() chpart_pattern = re.compile(r'ch-part no=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE) This makes a number of assumptions that are invalid about SGML in general, but may be valid for your sample text - how attributes are quoted, the lack of line breaks, which can be added without changing the content, and the format of the no attribute. while 1: if chpart_pattern.search(line): line = re.sub(rsec-main no=(\[0-9]*.[0-9]*\)title(.*), rsec-main no=\1title\2\nbiblio, line) Ditto. Heren's an sgmllib solution that gets does what you do above, except it writes it to standard out: #!/usr/bin/env python import sys from sgmllib import SGMLParser datain = ch-part no=ItitleRESEARCH GUIDE sec-main no=1.01titlecontent paracontent sec-main no=2.01titlecontent paracontent ch-part no=IItitleFORMS sec-main no=3.01titlecontent sec-sub1 no=1titlecontent paracontent sec-sub2 no=1titlecontent paracontent class Parser(SGMLParser): def __init__(self): # install the handlers with funny names setattr(self, start_ch-part, self.handle_ch_part) # And start with chapter 0 self.ch_num = 0 SGMLParser.__init__(self) def format_attributes(self, attributes): return ['%s=%s' % pair for pair in attributes] def unknown_starttag(self, tag, attributes): taglist = self.format_attributes(attributes) taglist.insert(0, tag) sys.stdout.write('%s' % ' '.join(taglist)) def handle_data(self, data): sys.stdout.write(data) def handle_ch_part(self, attributes): This should be called start_ch-part, but, well, you know. self.unknown_starttag('ch-part', attributes) for name, value in attributes: if name == 'no': self.ch_num = value def start_para(self, attributes): if self.ch_num == 'I': sys.stdout.write('biblio\n') self.unknown_starttag('para', attributes) parser = Parser() parser.feed(datain) parser.close() sgmllib isn't a very good SGML parser - it was written to support htmllib, and really only handles that subset of sgml well. In particular, it doesn't really understand DTDs, so can't handle the missing end tags in your example. You may be able to work around that. If you can coerce this to XML, then the xml tools in the standard library will work well. For HTML, I like BeautifulSoup, but that's mostly because it deals with all the crud on the net that is passed off as HTML. For SGML - well, I don't have a good answer. Last time I had to deal with real SGML, I used a C parser that spat out a parse tree that could be parsed properly. mike -- Mike Meyer [EMAIL PROTECTED] http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie Text Processing Question
Gregory Piñero wrote: That's how Python works. You read in the whole file, edit it, and write it back out. that's how file systems work. if file systems generally supported insert operations, Python would of course support that feature. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
Even though you are using re's to try to look for specific substrings (which you sort of fake in by splitting on Identifier, and then prepending Identifier to every list element, so that the re will match...), this program has quite a few holes. What if the word Identifier is inside one of the quoted strings? What if the actual value is tablename10? This will match your tablename1 string search, but it is certainly not what you want. Did you know there are trailing blanks on your table names, which could prevent any program name from matching? So here is an alternative approach using, as many have probably predicted by now if they've spent any time on this list, the pyparsing module. You may ask, isn't a parser overkill for this problem? and the answer will likely be probably, but in the case of pyparsing, I'd answer probably, but it is so easy, and takes care of so much junk like dealing with quoted strings and intermixed data, so, who cares if it's overkill? So here is the 20-line pyparsing solution, insert it into your program after you have read in tlst, and read in the input data using something like data = file('plst).read(). (The first line strips the whitespace from the ends of your table names.) tlist = map(str.rstrip, tlist) from pyparsing import quotedString,LineStart,LineEnd,removeQuotes quotedString.setParseAction( removeQuotes ) identLine = (LineStart() + Identifier + quotedString + LineEnd()).setResultsName(identifier) tableLine = (LineStart() + Value + quotedString + LineEnd()).setResultsName(tableref) interestingLines = ( identLine | tableLine ) thisprog = for toks,start,end in interestingLines.scanString( data ): toktype = toks.getName() if toktype == 'identifier': thisprog = toks[1] elif toktype == 'tableref': thistable = toks[1] if thistable in tlist: print '%s,%s' % (thisprog, thistable) else: print Not, thisprog, contains wrong table (+thistable+) This program will print out: Program1,tablename2 Program 2,tablename2 Download pyparsing at http://pyparsing.sourceforge.net. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
Hello pruebauno, import re f=file('tlst') tlst=f.read().split('\n') f.close() tlst = open(tlst).readlines() f=file('plst') sep=re.compile('Identifier (.*?)') plst=[] for elem in f.read().split('Identifier'): content='Identifier'+elem match=sep.search(content) if match: plst.append((match.group(1),content)) f.close() Look at re.findall, I think it'll be easier. flst=[] for table in tlst: for prog,content in plst: if content.find(table)0: if table in content: flst.append('%s,%s'%(prog,table)) flst.sort() for elem in flst: print elem print \n.join(sorted(flst)) HTH. -- Miki Tebeka [EMAIL PROTECTED] http://tebeka.bizhat.com The only difference between children and adults is the price of the toys pgp9Fde43cw8j.pgp Description: PGP signature -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
Paul McGuire wrote: match...), this program has quite a few holes. What if the word Identifier is inside one of the quoted strings? What if the actual value is tablename10? This will match your tablename1 string search, but it is certainly not what you want. Did you know there are trailing blanks on your table names, which could prevent any program name from matching? Good point. I did not think about that. I got lucky because none of the table names had trailing blanks (google groups seems to add those) the word identifier is not used inside of quoted strings anywhere and I do not have tablename10, but I do have dba.tablename1 and that one has to match with tablename1 (and magically did). So here is an alternative approach using, as many have probably predicted by now if they've spent any time on this list, the pyparsing module. You may ask, isn't a parser overkill for this problem? and You had to plug pyparsing! :-). Thanks for the info I did not know something like pyparsing existed. Thanks for the code too, because looking at the module it was not totally obvious to me how to use it. I tried run it though and it is not working for me. The following code runs but prints nothing at all: import pyparsing as prs f=file('tlst'); tlst=[ln.strip() for ln in f if ln]; f.close() f=file('plst'); plst=f.read() ; f.close() prs.quotedString.setParseAction(prs.removeQuotes) identLine=(prs.LineStart() + 'Identifier' + prs.quotedString + prs.LineEnd() ).setResultsName('prog') tableLine=(prs.LineStart() + 'Value' + prs.quotedString + prs.LineEnd() ).setResultsName('table') interestingLines=(identLine | tableLine) for toks,start,end in interestingLines.scanString(plst): print toks,start,end -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
Miki Tebeka wrote: Look at re.findall, I think it'll be easier. Minor changes aside the interesting thing, as you pointed out, would be using re.findall. I could not figure out how to. -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
[EMAIL PROTECTED] wrote: Paul McGuire wrote: match...), this program has quite a few holes. tried run it though and it is not working for me. The following code runs but prints nothing at all: import pyparsing as prs And this is the point where I have to post the real stuff because your code works with the example i posted and not with the real thing. The identifier I am interested in is (if I understood the the requirements correctly) the one after the title with the stars So here is the real data for tlst some info replaced with z to protect privacy: * Identifier zzz0main * Identifier zz501 Value zzz_CLCL_,zz_ID Name z Name zz * Identifier 3main * Identifier zzz505 Value dba.zzz_CKPY__SUM Name xxx_xxx_xxx_DT -- Value zzz__zzz_zzz Name zzz_zz_zzz -- Value zzz_zzz_zzz_HIST Name zzz_zzz -- -- http://mail.python.org/mailman/listinfo/python-list
Re: Improving my text processing script
Yes indeed, the real data often has surprising differences from the simulations! :) It turns out that pyparsing LineStart()'s are pretty fussy. Usually, pyparsing is very forgiving about whitespace between expressions, but it turns out that LineStart *must* be followed by the next expression, with no leading whitespace. Fortunately, your syntax is really quite forgiving, in that your key-value pairs appear to always be an unquoted word (for the key) and a quoted string (for the value). So you should be able to get this working just by dropping the LineStart()'s from your expressions, that is: identLine=('Identifier' + prs.quotedString + prs.LineEnd() ).setResultsName('prog') tableLine=('Value' + prs.quotedString + prs.LineEnd() ).setResultsName('table') See if that works any better for you. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Improving my text processing script
I am sure there is a better way of writing this, but how? import re f=file('tlst') tlst=f.read().split('\n') f.close() f=file('plst') sep=re.compile('Identifier (.*?)') plst=[] for elem in f.read().split('Identifier'): content='Identifier'+elem match=sep.search(content) if match: plst.append((match.group(1),content)) f.close() flst=[] for table in tlst: for prog,content in plst: if content.find(table)0: flst.append('%s,%s'%(prog,table)) flst.sort() for elem in flst: print elem What would be the best way of writing this program. BTW find0 to check in case table=='' (empty line) so I do not include everything. tlst is of the form: tablename1 tablename2 ... plst is of the form: Identifier Program1 Name Random Stuff Value tablename2 ...other random properties Name More Random Stuff Identifier Program 2 Name Yet more stuff Value tablename2 ... I want to know in what programs are the tables in tlst (and only those) used. -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing problem
Maurice LING wrote: Matt wrote: I'd HIGHLY suggest purchasing the excellent a href=http://www.oreilly.com/catalog/regex2/index.html;Mastering Regular Expressions/a by Jeff Friedl. Although it's mostly geared towards Perl, it will answer all your questions about regular expressions. If you're going to work with regexs, this is a must-have. That being said, here's what the new regular expression should be with a bit of instruction (in the spirit of teaching someone to fish after giving them a fish ;-) ) my_expr = re.compile(r'(\w+)\s*(\(\1\))') Note the \s*, in place of the single space . The \s means any whitespace character (equivalent to [ \t\n\r\f\v]). The * following it means 0 or more occurances. So this will now match: there (there) there (there) there(there) there (there) there\t(there) (tab) there\t\t\t\t\t\t\t\t\t\t\t\t(there) etc. Hope that's helpful. Pick up the book! M@ Thanks again. I've read a number of tutorials on regular expressions but it's something that I hardly used in the past, so gone far too rusty. Before my post, I've tried my_expr = re.compile(r'(\w+) \s* (\(\1\))') instead but it doesn't work, so I'm a bit stumped.. Thanks again, Maurice Maurice, The reason your regex failed is because you have spaces around the \s*. This translates to one space, followed by zero or more whitespace elements, followed by one space. So your regex would only match the two text elements separated by at least 2 spaces. This kind of demostrates why regular expressions can drive you nuts. I still suggests picking up the book; not because Jeff Friedl drove a dump truck full of money up to my door, but because it specifically has a use case like yours. So you get to learn solve your problem at the same time! HTH, M@ -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing problem
Maurice LING wrote: I'm looking for a way to do this: I need to scan a text (paragraph or so) and look for occurrences of text-x (text-x). That is, if the text just before the open bracket is the same as the text in the brackets, then I have to delete the brackets, with the text in it. How's this? import re bracket_re = re.compile(r'(.*?)\s*\(\1\)') def remove_brackets(text): return bracket_re.sub('\\1', text) -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing problem
Maurice LING wrote: Hi, I'm looking for a way to do this: I need to scan a text (paragraph or so) and look for occurrences of text-x (text-x). That is, if the text just before the open bracket is the same as the text in the brackets, then I have to delete the brackets, with the text in it. Does anyone knows any way to achieve this? The closest I've seen is (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by Raymond Hettinger s = 'People of [planet], take us to your leader.' d = dict(planet='Earth') print convert_template(s) % d People of Earth, take us to your leader. s = 'People of planet, take us to your leader.' print convert_template(s, '', '') % d People of Earth, take us to your leader. import re def convert_template(template, opener='[', closer=']'): opener = re.escape(opener) closer = re.escape(closer) pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' + closer) return re.sub(pattern, r'%(\1)s', template.replace('%','%%')) Cheers Maurice Try this: import re my_expr = re.compile(r'(\w+) (\(\1\))') s = this is (is) a test print my_expr.sub(r'\1', s) #prints 'this is a test' M@ -- http://mail.python.org/mailman/listinfo/python-list
Re: text processing problem
Matt wrote: Try this: import re my_expr = re.compile(r'(\w+) (\(\1\))') s = this is (is) a test print my_expr.sub(r'\1', s) #prints 'this is a test' M@ Thank you Matt. It works out well. The only think that gives it problem is in events as there (there), where between the word and the same bracketted word is more than one whitespaces... Cheers Maurice -- http://mail.python.org/mailman/listinfo/python-list