subject:"On text processing"

Re: Text Processing

2011-12-22 Thread Yigit Turgut

On Dec 21, 2:01 am, Alexander Kapps alex.ka...@web.de wrote:
 On 20.12.2011 22:04, Nick Dokos wrote:









  I have a text file containing such data ;

           A                B                C
  ---
  -2.0100e-01    8.000e-02    8.000e-05
  -2.e-01    0.000e+00   4.800e-04
  -1.9900e-01    4.000e-02    1.600e-04

  But I only need Section B, and I need to change the notation to ;

  8.000e-02 = 0.08
  0.000e+00 = 0.00
  4.000e-02 = 0.04
  Does it have to be python? If not, I'd go with something similar to

      sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}'

 Why sed and awk:

 awk 'NR2 {printf(%.2f\n, $2);}' data.txt

 And in Python:

 f = open(data.txt)
 f.readline()    # skip header
 f.readline()    # skip header
 for line in f:
      print %02s % float(line.split()[1])

@Jerome ; Your suggestion provided floating point error, it might need
some slight modificiation.

@Nick ; Sorry mate, it needs to be in Python. But I noted solution in
case if I need for another case.

@Alexander ; Works as expected.

Thank you all for the replies.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text Processing

2011-12-20 Thread Dave Angel


On 12/20/2011 02:17 PM, Yigit Turgut wrote:

Hi all,

I have a text file containing such data ;

 ABC
---
-2.0100e-018.000e-028.000e-05
-2.e-010.000e+00   4.800e-04
-1.9900e-014.000e-021.600e-04

But I only need Section B, and I need to change the notation to ;

8.000e-02 = 0.08
0.000e+00 = 0.00
4.000e-02 = 0.04

Text file is approximately 10MB in size. I looked around to see if
there is a quick and dirty workaround but there are lots of modules,
lots of options.. I am confused.

Which module is most suitable for this task ?
You probably don't need anything but sys (to parse the command options) 
and os (maybe).


open the file
for eachline
if one of the header lines, continue
separate out the part you want
print it, formatted as you like

Then just run the script with its stdout redirected, and you've got your 
new file


The details depend on what your experience with Python is, and what 
version of Python you're running.


--

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: Text Processing

2011-12-20 Thread Jérôme

Tue, 20 Dec 2011 11:17:15 -0800 (PST)
Yigit Turgut a écrit:

 Hi all,
 
 I have a text file containing such data ;
 
 ABC
 ---
 -2.0100e-018.000e-028.000e-05
 -2.e-010.000e+00   4.800e-04
 -1.9900e-014.000e-021.600e-04
 
 But I only need Section B, and I need to change the notation to ;
 
 8.000e-02 = 0.08
 0.000e+00 = 0.00
 4.000e-02 = 0.04
 
 Text file is approximately 10MB in size. I looked around to see if
 there is a quick and dirty workaround but there are lots of modules,
 lots of options.. I am confused.
 
 Which module is most suitable for this task ?

You could try to do it yourself.

You'd need to know what seperates the datas. Tabulation character ? Spaces ?

Exemple :

Input file
--

ABC
---
-2.0100e-018.000e-028.000e-05
-2.e-010.000e+004.800e-04
-1.9900e-014.000e-021.600e-04


Python code
---

# Open file
with open('test1.plt','r') as f:

b_values = []

# skip as many lines as needed
line = f.readline()
line = f.readline()
line = f.readline()

while line:
#start = line.find(u\u0009, 0) + 1   #seek Tab
start = line.find(   , 0) + 4#seek 4 spaces
#end = line.find(u\u0009, start)
end = line.find(   , start)
b_values.append(float(line[start:end].strip()))
line = f.readline()

print b_values

It gets trickier if the amount of spaces is not constant. I would then try
with regular expressions. Perhaps would regexp be more efficient in any case.

-- 
Jérôme
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text Processing

2011-12-20 Thread Nick Dokos

Jérôme jer...@jolimont.fr wrote:

 Tue, 20 Dec 2011 11:17:15 -0800 (PST)
 Yigit Turgut a écrit:
 
  Hi all,
  
  I have a text file containing such data ;
  
  ABC
  ---
  -2.0100e-018.000e-028.000e-05
  -2.e-010.000e+00   4.800e-04
  -1.9900e-014.000e-021.600e-04
  
  But I only need Section B, and I need to change the notation to ;
  
  8.000e-02 = 0.08
  0.000e+00 = 0.00
  4.000e-02 = 0.04
  
  Text file is approximately 10MB in size. I looked around to see if
  there is a quick and dirty workaround but there are lots of modules,
  lots of options.. I am confused.
  
  Which module is most suitable for this task ?
 
 You could try to do it yourself.
 

Does it have to be python? If not, I'd go with something similar to

   sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}'

Nick

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text Processing

2011-12-20 Thread Alexander Kapps


On 20.12.2011 22:04, Nick Dokos wrote:


I have a text file containing such data ;

 ABC
---
-2.0100e-018.000e-028.000e-05
-2.e-010.000e+00   4.800e-04
-1.9900e-014.000e-021.600e-04

But I only need Section B, and I need to change the notation to ;

8.000e-02 = 0.08
0.000e+00 = 0.00
4.000e-02 = 0.04



Does it have to be python? If not, I'd go with something similar to

sed 1,2d foo.data | awk '{printf(%.2f\n, $2);}'



Why sed and awk:

awk 'NR2 {printf(%.2f\n, $2);}' data.txt

And in Python:

f = open(data.txt)
f.readline()# skip header
f.readline()# skip header
for line in f:
print %02s % float(line.split()[1])
--
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread Ian Kelly

On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote:
 So, a solution by regex is out.

Actually, none of the complications you listed appear to exclude
regexes.  Here's a possible (untested) solution:

div class=img
((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+
height=[0-9]+)+)
\s*p class=cpt((?:[^]|(?!/p))+)/p
\s*/div

and corresponding replacement string:

figure
\1
figcaption\2/figcaption
/figure

I don't know what dialect Emacs uses for regexes; the above is the
Python re dialect.  I assume it is translatable.  If not, then the
above should at least work with other editors, such as Komodo's
Find/Replace in Files command.  I kept the line breaks here for
readability, but for completeness they should be stripped out of the
final regex.

The possibility of nested HTML in the caption is allowed for by using
a negative look-ahead assertion to accept any tag except a closing
/p.  It would break if you had nested p tags, but then that would
be invalid html anyway.

Cheers,
Ian
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread Xah Lee

On Jul 4, 12:13 pm, S.Mandl stefanma...@web.de wrote:
 Nice. I guess that XSLT would be another (the official) approach for
 such a task.
 Is there an XSLT-engine for Emacs?

 -- Stefan

haven't used XSLT, and don't know if there's one in emacs...

it'd be nice if someone actually give a example...

 Xah
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread Xah Lee

On Jul 5, 12:17 pm, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote:
  So, a solution by regex is out.

 Actually, none of the complications you listed appear to exclude
 regexes.  Here's a possible (untested) solution:

 div class=img
 ((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+
 height=[0-9]+)+)
 \s*p class=cpt((?:[^]|(?!/p))+)/p
 \s*/div

 and corresponding replacement string:

 figure
 \1
 figcaption\2/figcaption
 /figure

 I don't know what dialect Emacs uses for regexes; the above is the
 Python re dialect.  I assume it is translatable.  If not, then the
 above should at least work with other editors, such as Komodo's
 Find/Replace in Files command.  I kept the line breaks here for
 readability, but for completeness they should be stripped out of the
 final regex.

 The possibility of nested HTML in the caption is allowed for by using
 a negative look-ahead assertion to accept any tag except a closing
 /p.  It would break if you had nested p tags, but then that would
 be invalid html anyway.

 Cheers,
 Ian

that's fantastic. Thanks! I'll try it out.

 Xah
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread Xah Lee

On Jul 5, 12:17 pm, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Mon, Jul 4, 2011 at 12:36 AM, Xah Lee xah...@gmail.com wrote:
  So, a solution by regex is out.

 Actually, none of the complications you listed appear to exclude
 regexes.  Here's a possible (untested) solution:

 div class=img
 ((?:\s*img src=[^.]+\.(?:jpg|png|gif) alt=[^]+ width=[0-9]+
 height=[0-9]+)+)
 \s*p class=cpt((?:[^]|(?!/p))+)/p
 \s*/div

 and corresponding replacement string:

 figure
 \1
 figcaption\2/figcaption
 /figure

 I don't know what dialect Emacs uses for regexes; the above is the
 Python re dialect.  I assume it is translatable.  If not, then the
 above should at least work with other editors, such as Komodo's
 Find/Replace in Files command.  I kept the line breaks here for
 readability, but for completeness they should be stripped out of the
 final regex.

 The possibility of nested HTML in the caption is allowed for by using
 a negative look-ahead assertion to accept any tag except a closing
 /p.  It would break if you had nested p tags, but then that would
 be invalid html anyway.

 Cheers,
 Ian

emacs regex supports shygroup (the 「(?:…)」) but it doesn't support the
negative assertion 「?!…」 though.

but in anycase, i can't see how this part would work
p class=cpt((?:[^]|(?!/p))+)/p

?

 Xah
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread Ian Kelly

On Tue, Jul 5, 2011 at 2:37 PM, Xah Lee xah...@gmail.com wrote:
 but in anycase, i can't see how this part would work
 p class=cpt((?:[^]|(?!/p))+)/p

It's not that different from the pattern 「alt=[^]+」 earlier in the
regex.  The capture group accepts one or more characters that either
aren't '', or that are '' but are not immediately followed by '/p'.
 Thus it stops capturing when it sees exactly '/p' without consuming
the ''.  Using my regex with the example that you posted earlier
demonstrates that it works:

 import re
 s = '''div class=img
... img src=jamie_cat.jpg alt=jamie's cat width=167 height=106
... p class=cptjamie's cat! Her blog is a href=http://example.com/
... jamie/http://example.com/jamie//a/p
... /div'''
 print re.sub(pattern, replace, s)
figure
img src=jamie_cat.jpg alt=jamie's cat width=167 height=106
figcaptionjamie's cat! Her blog is a href=http://example.com/
jamie/http://example.com/jamie//a/figcaption
/figure

Cheers,
Ian
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-05 Thread S.Mandl

 haven't used XSLT, and don't know if there's one in emacs...

 it'd be nice if someone actually give a example...


Hi Xah, actually I have to correct myself. HTML is not XML. If it
were, you
could use a stylesheet like this:

?xml version=1.0 encoding=ISO-8859-1?
xsl:stylesheet version=1.0
xmlns:xsl=http://www.w3.org/1999/XSL/Transform;

xsl:template match=p[@class='cpt']
  figcaption
xsl:value-of select=./
  /figcaption
/xsl:template

xsl:template match=div[@class='img']
  figure
xsl:apply-templates select=@*|node()/
  /figure
/xsl:template

xsl:template match=@*|node()
  xsl:copy
xsl:apply-templates select=@*|node()/
  /xsl:copy
/xsl:template


/xsl:stylesheet

which applied to this document:

?xml version=1.0 encoding=ISO-8859-1?
doc
h1Just having fun/h1with all the
div class=img
  img src=cat1.jpg alt=my cat width=200 height=200/
  img src=cat2.jpg alt=my cat width=200 height=200/
  p class=cptmy 2 cats/p
/div
cats here:
h1Just fooling around/h1
div class=img
  img src=jamie_cat.jpg alt=jamie's cat width=167 height=106/

  p class=cptjamie's cat! Her blog is a href=http://example.com/
jamie/http://example.com/jamie//a/p
/div
/doc

would yield:

?xml version=1.0?
doc
h1Just having fun/h1with all the
figure class=img
  img src=cat1.jpg alt=my cat width=200 height=200/
  img src=cat2.jpg alt=my cat width=200 height=200/
  figcaptionmy 2 cats/figcaption
/figure
cats here:
h1Just fooling around/h1
figure class=img
  img src=jamie_cat.jpg alt=jamie's cat width=167 height=106/

  figcaptionjamie's cat! Her blog is http://example.com/jamie//figcaption
/figure
/doc

But well, as you don't have XML as input ... there really was no point
to my remark.

Best,
Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

emacs lisp text processing example (html5 figure/figcaption)

2011-07-04 Thread Xah Lee

OMG, emacs lisp beats perl/python again!

Hiya all, another little emacs lisp tutorial from the tiny Xah's Edu
Corner.

〈Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and
“figcaption” Tags〉
xahlee.org/emacs/elisp_batch_html5_tag_transform.html

plain text version follows.

--

Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and
“figcaption” Tags

Xah Lee, 2011-07-03

Another triumph of using elisp for text processing over perl/python.


The Problem

--
Summary

I want batch transform the image tags in 5 thousand html files to use
HTML5's new “figure” and “figcaption” tags.

I want to be able to view each change interactively, while optionally
give it a “go ahead” to do the whole job in batch.

Interactive eye-ball verification on many cases lets me be reasonably
sure the transform is done correctly. Yet i don't want to spend days
to think/write/test a mathematically correct program that otherwise
can be finished in 30 min with human interaction.

--
Detail

HTML5 has the following new tags: “figure” and “figcaption”. They are
used like this:

figure
img src=cat.jpg alt=my cat width=167 height=106
figcaptionmy cat!/figcaption
/figure

(For detail, see: HTML5 “figure” ＆ “figurecaption” Tags Browser
Support)

On my website, i used a similar structure. They look like this:

div class=img
img src=cat.jpg alt=my cat width=167 height=106
p class=cptmy cat!/p
/div

So, i want to replace them with the HTML5's new tags. This can be done
with a regex. Here's the “find” regex:

div class=img
?img src=\([^.]+?\)\.jpg alt=\([^]+?\) width=\([0-9]+?\)
height=\([0-9]+?\)?
p class=cpt\([^]+?\)/p
?/div

Here's the replacement string:

figure
img src=\1.jpg alt=\2 width=\3 height=\4
figcaption\5/figcaption
/figure

Then, you can use “find-file” and dired's “dired-do-query-replace-
regexp” to work on your 5 thousand pages. Nice. (See: Emacs:
Interactively Find ＆ Replace String Patterns on Multiple Files.)

However, the problem here is more complicated. The image file may be
jpg or png or gif. Also, there may be more than one image per group.
Also, the caption part may also contain complicated html. Here's some
examples:

div class=img
img src=cat1.jpg alt=my cat width=200 height=200
img src=cat2.jpg alt=my cat width=200 height=200
p class=cptmy 2 cats/p
/div

div class=img
img src=jamie_cat.jpg alt=jamie's cat width=167 height=106
p class=cptjamie's cat! Her blog is a href=http://example.com/
jamie/http://example.com/jamie//a/p
/div

So, a solution by regex is out.


Solution

The solution is pretty simple. Here's the major steps:

Use “find-lisp-find-files” to traverse a dir.
For each file, open it.
Search for the string div class=img
Use “sgml-skip-tag-forward” to jump to its closing tag.
Save the positions of these tag begin/end positions.
Ask user if she wants to replace. If so, do it. (using “delete-
region” and “insert”)
Repeat.

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-07-03
;; replace image tags to use html5's “figure”  and “figcaption” tags.

;; Example. This:
;; div class=img…/div
;; should become this
;; figure…/figure

;; do this for all files in a dir.

;; rough steps:
;; find the div class=img
;; use sgml-skip-tag-forward to move to the ending tag.
;; save their positions.


(defun my-process-file (fpath)
  process the file at fullpath FPATH ...
  (let (mybuff p1 p2 p3 p4 )
(setq mybuff (find-file fpath))

(widen)
(goto-char 0) ;; in case buffer already open

(while (search-forward div class=\img\ nil t)
  (progn
(setq p2 (point) )
(backward-char 17) ; beginning of “div” tag
(setq p1 (point) )

(forward-char 1)
(sgml-skip-tag-forward 1) ; move to the closing tag
(setq p4 (point) )
(backward-char 6) ; beginning of the closing div tag
(setq p3 (point) )
(narrow-to-region p1 p4)

(when (y-or-n-p replace?)
  (progn
(delete-region p3 p4 )
(goto-char p3)
(insert /figure)

(delete-region p1 p2 )
(goto-char p1)
(insert figure)
(widen) ) ) ) )

(when (not (buffer-modified-p mybuff)) (kill-buffer mybuff) )

) )

(require 'find-lisp)


(let (outputBuffer)
  (setq outputBuffer *xah img/figure replace output* )
  (with-output-to-temp-buffer outputBuffer
(mapc 'my-process-file (find-lisp-find-files ~/web/xahlee_org/
emacs/ \\.html$))
(princ Done deal!)
) )

Seems pretty simple right?

The “p1” and “p2” variables are the positions of start/end of div
class=img. The “p3” and “p4” is the start/end of it's closing tag /
div.

We also used a little trick with “widen” and “narrow-to-region”. It
lets me see just the part that i'm interested. It narrows to the
beginning/end of the div.img. This makes eye-balling a bit easier.

The real time

Re: emacs lisp text processing example (html5 figure/figcaption)

2011-07-04 Thread S.Mandl

Nice. I guess that XSLT would be another (the official) approach for
such a task.
Is there an XSLT-engine for Emacs?

-- Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?

2010-12-16 Thread python

Is text processing with dicts a good use case for Python
cross-compilers like Cython/Pyrex or ShedSkin? (I've read the
cross compiler claims about massive increases in pure numeric
performance).

I have 3 use cases I'm considering for Python-to-C++
cross-compilers for generating 32-bit Python extension modules
for Python 2.7 for Windows.

1. Parsing UTF-8 files (basic Python with lots of string
processing and dict lookups)

2. Generating UTF-8 files from nested list/dict structures

3. Parsing large ASCII CSV-like files and using dict's to
calculate simple statistics like running totals, min, max, etc.

Are any of these text processing scenarios good use cases for
tools like Cython, Pyrex, or ShedSkin? Are any of these
specifically bad use cases for these tools?

We've tried Psyco and it has sped up some of our parsing
utilities by 200%. But Psyco doesn't support Python 2.7 yet and
we're committed to using Python 2.7 moving forward.

Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?

2010-12-16 Thread Stefan Behnel


pyt...@bdurham.com, 16.12.2010 21:03:

Is text processing with dicts a good use case for Python
cross-compilers like Cython/Pyrex or ShedSkin? (I've read the
cross compiler claims about massive increases in pure numeric
performance).


Cython is generally a good choice for string processing, simply because it 
can drop a lot of code into plain C, such as character iteration and 
comparison. Depending on what kind of operations you do, you can get 
speed-ups of 100x or more for that.


http://docs.cython.org/src/tutorial/strings.html

However, when it comes to dict lookups, it uses CPython's own dicts which 
are heavily optimised for string lookups already. So the speedup in that 
area will likely stay below 30%. Similarly, encoding and decoding use 
Python's codecs, so don't expect a major difference there.




I have 3 use cases I'm considering for Python-to-C++
cross-compilers for generating 32-bit Python extension modules
for Python 2.7 for Windows.

1. Parsing UTF-8 files (basic Python with lots of string
processing and dict lookups)


Parsing sounds like something that could easily benefit from Cython 
compilation.




2. Generating UTF-8 files from nested list/dict structures


That should be much faster in Cython, too, simply because iteration on 
builtin types is much faster than in Python.




3. Parsing large ASCII CSV-like files and using dict's to
calculate simple statistics like running totals, min, max, etc.


Again, parsing will be much faster, especially when reading from raw C 
files (which would also enable freeing the GIL, in case you want to use 
multi-threading). The rest may not win that much.


A nice feature of Cython is that you do not have to go low-level right 
away. You can use all the niceness of Python, and only push the code closer 
to C level where your benchmarks point you. And if you really have to go 
all the way down to C, it's just a declaration away.




Are any of these text processing scenarios good use cases for
tools like Cython, Pyrex, or ShedSkin? Are any of these
specifically bad use cases for these tools?


Pyrex isn't worth trying here, simply because you'd have to invest a lot 
more work to make it as fast as what Cython gives you anyway. ShedSkin may 
be worth a try, depending on how well you get your ShedSkin module 
integrated with CPython. (It seems that it has support for building 
extension modules by now, but I have no idea how well that is fleshed out).




We've tried Psyco and it has sped up some of our parsing
utilities by 200%. But Psyco doesn't support Python 2.7 yet and
we're committed to using Python 2.7 moving forward.


If 3x is not enough for you, I strongly suggest you try Cython. The C code 
that it generates compiles nicely in all major Python versions, currently 
from 2.3 to 3.2.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Simple Text Processing

2009-09-10 Thread AJAskey

New to Python.  I can solve the problem in perl by using split() to
an array.  Can't figure it out in Python.

I'm reading variable lines of text.  I want to use the first number I
find.  The problem is the lines are variable.

Input example:
  this is a number: 1
  here are some numbers 1 2 3 4

In both lines I am only interested in the 1.  I can't figure out how
to use split() as it appears to make me know how many space
separated words are in the line.  I do not know this.

I use:  a,b,c,e = split() to get the first line in the example.  The
second line causes a runtime exception.  Can I use split for this?
Is there another simple way to break the words into an array that I
can loop over?

Thanks.
Andy
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing

2009-09-10 Thread Benjamin Kaplan

On Thu, Sep 10, 2009 at 11:36 AM, AJAskey aske...@gmail.com wrote:

 New to Python.  I can solve the problem in perl by using split() to
 an array.  Can't figure it out in Python.

 I'm reading variable lines of text.  I want to use the first number I
 find.  The problem is the lines are variable.

 Input example:
  this is a number: 1
  here are some numbers 1 2 3 4

 In both lines I am only interested in the 1.  I can't figure out how
 to use split() as it appears to make me know how many space
 separated words are in the line.  I do not know this.

 I use:  a,b,c,e = split() to get the first line in the example.  The
 second line causes a runtime exception.  Can I use split for this?
 Is there another simple way to break the words into an array that I
 can loop over?

  line = here are some numbers 1 2 3 4
 a = line.split()
 a
['here', 'are', 'some', 'numbers', '1', '2', '3', '4']
 #Python 3 only
... a,b,c,d,*e = line.split()
 e
['1', '2', '3', '4']




 Thanks.
 Andy
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing

2009-09-10 Thread AJAskey

Never mind.  I guess I had been trying to make it more difficult than
it is.  As a note, I can work on something for 10 hours and not figure
it out.  But the second I post to a group, then I immediately figure
it out myself. Strange snake this Python...

Example for anyone else interested:

line = this is a line
print line
a = line.split()
print a
print a[0]
print a[1]
print a[2]
print a[3]

--
OUTPUT:

this is a line
['this', 'is', 'a', 'line']
this
is
a
line



On Sep 10, 11:36 am, AJAskey aske...@gmail.com wrote:
 New to Python.  I can solve the problem in perl by using split() to
 an array.  Can't figure it out in Python.

 I'm reading variable lines of text.  I want to use the first number I
 find.  The problem is the lines are variable.

 Input example:
   this is a number: 1
   here are some numbers 1 2 3 4

 In both lines I am only interested in the 1.  I can't figure out how
 to use split() as it appears to make me know how many space
 separated words are in the line.  I do not know this.

 I use:  a,b,c,e = split() to get the first line in the example.  The
 second line causes a runtime exception.  Can I use split for this?
 Is there another simple way to break the words into an array that I
 can loop over?

 Thanks.
 Andy

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: text processing SOLVED

2008-09-27 Thread [EMAIL PROTECTED]


Thanks Black Jack
Working
--
http://mail.python.org/mailman/listinfo/python-list

Re: text processing

2008-09-25 Thread Marc 'BlackJack' Rintsch

On Thu, 25 Sep 2008 15:51:28 +0100, [EMAIL PROTECTED] wrote:

 I have string like follow
 12560/ABC,12567/BC,123,567,890/JK
 
 I want above string to group like as follow (12560,ABC)
 (12567,BC)
 (123,567,890,JK)
 
 i try regular expression i am able to get first two not the third one.
 can regular expression given data in different groups

Without regular expressions:

def group(string):
result = list()
for item in string.split(','):
if '/' in item:
result.extend(item.split('/'))
yield tuple(result)
result = list()
else:
result.append(item)

def main():
string = '12560/ABC,12567/BC,123,567,890/JK'
print list(group(string))

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list

Re: text processing

2008-09-25 Thread kib2


You can do it with regexps too :

--
import re
to_watch = re.compile(r(?Pnumber\d+)[/](?Pletter[A-Z]+))

final_list = to_watch.findall(12560/ABC,12567/BC,123,567,890/JK)

for number,word in final_list :
print number:%s -- word: %s%(number,word)
--

the output is :

number:12560 -- word: ABC
number:12567 -- word: BC
number:890 -- word: JK

See you,

Kib².
--
http://mail.python.org/mailman/listinfo/python-list

Re: text processing

2008-09-25 Thread MRAB

On Sep 25, 6:34 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
 On Thu, 25 Sep 2008 15:51:28 +0100, [EMAIL PROTECTED] wrote:
  I have string like follow
  12560/ABC,12567/BC,123,567,890/JK

  I want above string to group like as follow (12560,ABC)
  (12567,BC)
  (123,567,890,JK)

  i try regular expression i am able to get first two not the third one.
  can regular expression given data in different groups

 Without regular expressions:

 def group(string):
     result = list()
     for item in string.split(','):
         if '/' in item:
             result.extend(item.split('/'))
             yield tuple(result)
             result = list()
         else:
             result.append(item)

 def main():
     string = '12560/ABC,12567/BC,123,567,890/JK'
     print list(group(string))

How about:

 string = 12560/ABC,12567/BC,123,567,890/JK
 r = re.findall(r(\d+(?:,\d+)*/\w+), string)
 r
['12560/ABC', '12567/BC', '123,567,890/JK']
 [tuple(x.replace(,, /).split(/)) for x in r]
[('12560', 'ABC'), ('12567', 'BC'), ('123', '567', '890', 'JK')]
--
http://mail.python.org/mailman/listinfo/python-list

Re: text processing

2008-09-25 Thread Paul McGuire

On Sep 25, 9:51 am, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 I have string like follow
 12560/ABC,12567/BC,123,567,890/JK

 I want above string to group like as follow
 (12560,ABC)
 (12567,BC)
 (123,567,890,JK)

 i try regular expression i am able to get first two not the third one.
 can regular expression given data in different groups

Looks like each item is:
- a list of 1 or more integers, in a comma-delimited list
- a slash
- a word composed of alpha characters

And the whole thing is a list of items in a comma-delimited list

Now to implement that in pyparsing:

 data = 12560/ABC,12567/BC,123,567,890/JK
 from pyparsing import Suppress, delimitedList, Word, alphas, nums, Group
 SLASH = Suppress('/')
 dataitem = delimitedList(Word(nums)) + SLASH + Word(alphas)
 dataformat = delimitedList(Group(dataitem))
 map(tuple, dataformat.parseString(data))
[('12560', 'ABC'), ('12567', 'BC'), ('123', '567', '890', 'JK')]

Wah-lah! (as one of my wife's 1st graders announced in one of his
school papers)

-- Paul


--
http://mail.python.org/mailman/listinfo/python-list

emacs lisp as text processing language...

2007-10-29 Thread Xah Lee

Text Processing with Emacs Lisp

Xah Lee, 2007-10-29

This page gives a outline of how to use emacs lisp to do text
processing, using a specific real-world problem as example. If you
don't know elisp, first take a gander at Emacs Lisp Basics.

HTML version with links and colors is at:
http://xahlee.org/emacs/elisp_text_processing.html

Following this post as a separate post, is some relevant (i hope)
remark about Perl and Python.

-
THE PROBLEM


Summary

I want to write a elisp program, that process a list of given files.
Each file is a HTML file. For each file, i want to remove the link to
itself, in its page navigation bar. More specifically, each file have
a page navigation bar in this format:

div class=pagesGoto Page: a href=1.html1/a, a
href=2.html2/a, a href=3.html3/a, a href=4.html3/
a, .../div.

where the file names and link texts are all arbitrary. (not as 1, 2, 3
shown here.) The link to itself needs to be removed.


Detail

My website has over 3 thousand files; many of the pages is a series.
For example, i have a article on Algorithmic Mathematical Art, which
is broken into 3 HTML pages. So, at the bottom of each page, i have a
page navigation bar with code like this:

div class=pagesGoto Page: a href=20040113_cmaci_larcu.html1/
a, a href=cmaci_larcu2.html2/a, a href=cmaci_larcu3.html3/
a/div

In a browser, it would look like this:
i/page tag

Note that the link to the page itself really shouldn't be a link.

There are a total of 134 pages scattered about in various directories
that has this page navigation bar. I need some automated way to
process these files and remove the self-link.

I've been programing in perl professionally from 1998 to 2002 full
time. Typically, for this task in perl (or Python), i'd open each
file, read in the file, then use regex to do the replacement, then
write out the file. For replacement that span over several lines, the
regex needs to act on the whole file (as opposed to one line at a
time). The regex can become quite complex or reaching its limit. For a
more robust solution, a XML/HTML parser package can be used to read in
the file into a structured representation, then process that. Using a
HTML parser is a bit involved. Then, as usual, one may need to create
backups of the original files, and also deal with maintaining the
file's meta info such as keeping the same permission bits. In summary,
if the particular text-processing required is not simple, then the
coding gets fairly complex quickly, even if job is trivial in
principle.

With emacs lisp, the task is vastly simplified, because emacs reads in
a file into its buffer representation. With buffers, one can move a
pointer back and forth, search and delete or insert text arbitrarily,
with the entire emacs lisp's suite of functions designed for
processing text, as well the entire emacs environment that
automatically deals with maintaining file. (symbolic links, hard
links, auto-backup system, file meta-info maintaince, file locking,
remote files... etc).

We proceed to write a elisp code to solve this problem.

-
SOLUTION

Here's are the steps we need to do for each file:

* open the file in a buffer
* move cursor to the page navigation text.
* move cursor to file name
* run sgml-delete-tag (removes the link)
* save file
* close buffer

We begin by writing a test code to process a single file.

(defun xx ()
  temp. experimental code
  (interactive)
  (let (fpath fname mybuffer)
(setq fpath /Users/xah/test1.html)
(setq fname (file-name-nondirectory fpath))
(setq mybuffer (find-file fpath))
(search-forward div class=\pages\Goto Page:)
(search-forward fname)
(sgml-delete-tag 1)
(save-buffer)
(kill-buffer mybuffer)))

First of all, create files test1.html, test2.html, test3.html in a
temp directory for testing this code. Each file will contain this page
navigation line:

div class=pagesGoto Page: a href=test1.htmlsome1/a, a
href=test2.htmlanother/a, a href=test3.htmlxyz3/a/div

Note that in actual files, the page-nav string may not be in a single
line.

The elisp code above is fairly simple and self-explanatory. The file
opening function find-file is found from elisp doc section “Files”.
The cursor moving function search-forward is in “Searching and
Matching”, the save or close buffer fuctions are in section “Buffer”.

Reference: Elisp Manual: Files.

Reference: Elisp Manual: Buffers.

Reference: Elisp Manual: Searching-and-Matching.

The interesting part is calling the function sgml-delete-tag. It is a
function loaded by html-mode (which is automatically loaded when a
html file is opened). What sgml-delete-tag does is to delete the tag
that encloses the cursor (both the opening and closing tags will de
deleted). The cursor can be anywhere in the beginning angle bracket of
the opening to the ending angle bracket of the closing tag. This sgml-
delete-tag function

Re: emacs lisp as text processing language...

2007-10-29 Thread Xah Lee

... continued from previous post.

PS I'm cross-posting this post to perl and python groups because i
find that it being a little know fact that emacs lisp's power in the
area of text processing, are far beyond Perl (or Python).

... i worked as a professional perl programer since 1998. I started to
study elisp as a hobby since 2005. (i started to use emacs daily since
1998) It is only today, while i was studying elisp's file and buffer
related functions, that i realized how elisp can be used as a general
text processing language, and in fact is a dedicated language for this
task, with powers quite beyond Perl (or Python, PHP (Ruby, java, c
etc) etc).

This realization surprised me, because it is well-known that Perl is
the de facto language for text processing, and emacs lisp for this is
almost unknown (outside of elisp developers). The surprise was
exasperated by the fact that Emacs Lisp existed before perl by almost
a decade. (Albeit Emacs Lisp is not suitable for writing general
applications.)

My study about lisp as a text processing tool today, remind me of a
article i read in 2000: “Ilya Regularly Expresses”, of a interview
with Dr Ilya Zakharevich (author of cperl-mode.el and a major
contributor to the Perl language). In the article, he mentioned
something about Perl's lack of text processing primitives that are in
emacs, which i did not fully understand at the time. (i don't know
elisp at the time)

The article is at:
http://www.perl.com/lpt/a/2000/09/ilya.html

Here's the relevant excerpt:
«
Let me also mention that classifying the text handling facilities of
Perl as extremely agile gives me the willies. Perl's regular
expressions are indeed more convenient than in other languages.
However, the lack of a lot of key text-processing ingredients makes
Perl solutions for many averagely complicated tasks either extremely
slow, or not easier to maintain than solutions in other languages (and
in some cases both).

I wrote a (heuristic-driven) Perlish syntax parser and transformer in
Emacs Lisp, and though Perl as a language is incomparably friendlier
than Lisps, I would not be even able of thinking about rewriting this
tool in Perl: there are just not enough text-handling primitives
hardwired into Perl. I will need to code all these primitives first.
And having these primitives coded in Perl, the solution would turn out
to be (possibly) hundreds times slower than the built-in Emacs
operations.

My current conjecture on why people classify Perl as an agile text-
handler (in addition to obvious traits of false advertisements) is
that most of the problems to handle are more or less trivial (system
maintenance-type problems). For such problems Perl indeed shines. But
between having simple solutions for simple problems and having it
possible to solve complicated problems, there is a principle of having
moderately complicated solutions for moderately complicated problems.
There is no reason for Perl to be not capable of satisfying this
requirement, but currently Perl needs improvement in this regard.
»

  Xah
  [EMAIL PROTECTED]
∑ http://xahlee.org/

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-17 Thread Tim Roberts

[EMAIL PROTECTED] wrote:

And now for something completely different...

I've been reading up a bit about Python and Excel and I quickly told
the program to output to Excel quite easily.  However, what if the
input file were a Word document?  I can't seem to find much
information about parsing Word files.  What could I add to make the
same program work for a Word file?

Word files are not human-readable.  You parse them using
Dispatch(Word.Application), just the way you wrote the Excel file.

I believe there are some third-party modules that will read a Word file a
little more directly.
-- 
Tim Roberts, [EMAIL PROTECTED]
Providenza  Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-16 Thread Peter Otten

patrick.waldo wrote:

 manipulation?  Also, I conceptually get it, but would you mind walking
 me through

 for key, group in groupby(instream, unicode.isspace):
 if not key:
 yield .join(group)

itertools.groupby() splits a sequence into groups with the same key; e. g.
to group names by their first letter you'd do the following:

 def first_letter(s): return s[:1]
... 
 for key, group in groupby([Anne, Andrew, Bill, Brett, Alex], 
 first_letter):
... print --- %s --- % key
... for item in group:
... print item
... 
--- A ---
Anne
Andrew
--- B ---
Bill
Brett
--- A ---
Alex

Note that there are two groups with the same initial; groupby() considers
only consecutive items in the sequence for the same group.

In your case the sequence are the lines in the file, converted to unicode
strings -- the key is a boolean indicating whether the line consists
entirely of whitespace or not,

 u\n.isspace()
True
 ualpha\n.isspace()
False

but I call it slightly differently, as an unbound method:

 unicode.isspace(ualpha\n)
False

This is only possible because all items in the sequence are known to be
unicode instances. So far we have, using a list instead of a file:

 instream = [ualpha\n, ubeta\n, u\n, ugamma\n, u\n,  u\n, 
 udelta\n]
 for key, group in groupby(instream, unicode.isspace):
... print --- %s --- % key
... for item in group:
... print repr(item)
... 
--- False ---
u'alpha\n'
u'beta\n'
--- True ---
u'\n'
--- False ---
u'gamma\n'
--- True ---
u'\n'
u'\n'
--- False ---
u'delta\n'

As you see, groups with real data alternate with groups that contain only
blank lines, and the key for the latter is True, so we can skip them with

if not key: # it's not a separator group
   yield group 

As the final refinement we join all lines of the group into a single
string

 .join(group)
u'alpha\nbeta\n'

and that's it.

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-16 Thread patrick . waldo

And now for something completely different...

I see a lot of COM stuff with Python for excel...and I quickly made
the same program output to excel.  What if the input file were a Word
document?  Where is there information about manipulating word
documents, or what could I add to make the same program work for word?

Again thanks a lot.  I'll start hitting some books about this sort of
text manipulation.

The Excel add on:

import codecs
import re
from win32com.client import Dispatch

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$')   #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) = 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch(Excel.Application)
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-16 Thread patrick . waldo

And now for something completely different...

I've been reading up a bit about Python and Excel and I quickly told
the program to output to Excel quite easily.  However, what if the
input file were a Word document?  I can't seem to find much
information about parsing Word files.  What could I add to make the
same program work for a Word file?

Again thanks a lot.

And the Excel Add on...

import codecs
import re
from win32com.client import Dispatch

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$')   #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) = 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch(Excel.Application)
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

 lines = open('your_file.txt').readlines()[:4]
 print lines
 print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1  C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3.  I got the line by line
part.  My code is a lot cleaner now:

import codecs

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]   #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens)   #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-769-93-2
kyselina mocová  C5H4N4O3

200-001-8   50-00-0
formaldehyd  CH2O

200-002-3
50-01-1
guanidínium-chlorid  CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
 s = u'|'.join(token)
 print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines?  When I try
to store the tokens in a list, the tokens double and I don't know
why.  I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious.  The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together.  Something like
if tokens.startswith('pattern') == true


Again, thanks so much.  I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick


On Oct 14, 11:17 pm, John Machin [EMAIL PROTECTED] wrote:
 On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote:



  Hi all,

  I started Python just a little while ago and I am stuck on something
  that is really simple, but I just can't figure out.

  Essentially I need to take a text document with some chemical
  information in Czech and organize it into another text file.  The
  information is always EINECS number, CAS, chemical name, and formula
  in tables.  I need to organize them into lines with | in between.  So
  it goes from:

  200-763-1 71-73-8
  nátrium-tiopentál   C11H18N2O2S.Na   to:

  200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

  but if I have a chemical like: kyselina močová

  I get:
  200-720-7|69-93-2|kyselina|močová
  |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

  and then it is all off.

  How can I get Python to realize that a chemical name may have a space
  in it?

 Your input file could be in one of THREE formats:
 (1) fields are separated by TAB characters (represented in Python by
 the escape sequence '\t', and equivalent to '\x09')
 (2) fields are fixed width and padded with spaces
 (3) fields are separated by a random number of whitespace characters
 (and can contain spaces).

 What makes you sure that you have format 3? You might like to try
 something like
 lines = open('your_file.txt').readlines()[:4]
 print lines
 print map(len, lines)
 This will print a *precise* representation of what is in the first
 four lines, plus their lengths. Please show us the output.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

 lines = open('your_file.txt').readlines()[:4]
 print lines
 print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1  C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3.  I got the line by line
part.  My code is a lot cleaner now:

import codecs

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]   #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens)   #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-769-93-2
kyselina mocová  C5H4N4O3

200-001-8   50-00-0
formaldehyd  CH2O

200-002-3
50-01-1
guanidínium-chlorid  CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
 s = u'|'.join(token)
 print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines?  When I try
to store the tokens in a list, the tokens double and I don't know
why.  I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious.  The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.  This seems to be on the
only pattern.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together.  Something like

if tokens[1] and tokens[2] startswith('pattern') == true
tokens[2] = join(tokens[2]:tokens[3])
token[3] = token[4]
del token[4]

but the code isn't right...any ideas?

Again, thanks so much.  I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin [EMAIL PROTECTED] wrote:
 On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote:



  Hi all,

  I started Python just a little while ago and I am stuck on something
  that is really simple, but I just can't figure out.

  Essentially I need to take a text document with some chemical
  information in Czech and organize it into another text file.  The
  information is always EINECS number, CAS, chemical name, and formula
  in tables.  I need to organize them into lines with | in between.  So
  it goes from:

  200-763-1 71-73-8
  nátrium-tiopentál   C11H18N2O2S.Na   to:

  200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

  but if I have a chemical like: kyselina močová

  I get:
  200-720-7|69-93-2|kyselina|močová
  |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

  and then it is all off.

  How can I get Python to realize that a chemical name may have a space
  in it?

 Your input file could be in one of THREE formats:
 (1) fields are separated by TAB characters (represented in Python by
 the escape sequence '\t', and equivalent to '\x09')
 (2) fields are fixed width and padded with spaces
 (3) fields are separated by a random number of whitespace characters
 (and can contain spaces).

 What makes you sure that you have format 3? You might like to try
 something like
 lines = open('your_file.txt').readlines()[:4]
 print lines
 print map(len, lines)
 This will print a *precise* representation of what is in the first
 four lines, plus their lengths. Please show us the output.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread Marc 'BlackJack' Rintsch

On Mon, 15 Oct 2007 10:47:16 +, patrick.waldo wrote:

 my sample input file looks like this( not organized,as you see it):
 200-720-769-93-2
 kyselina mocová  C5H4N4O3
 
 200-001-8   50-00-0
 formaldehyd  CH2O
 
 200-002-3
 50-01-1
 guanidínium-chlorid  CH5N3.ClH
 
 etc...

That's quite irregular so it is not that straightforward.  One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters.  Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$')

def iter_elements(tokens):
tokens = iter(tokens)
try:
nr_a = tokens.next()
while True:
nr_b = tokens.next()
items = list()
for item in tokens:
if NR_RE.match(item):
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
nr_a = item
break
else:
items.append(item)
except StopIteration:
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])



def main():
in_file = codecs.open('test.txt', 'r', 'utf-8')
tokens = in_file.read().split()
in_file.close()
for element in iter_elements(tokens):
print '|'.join(element)

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread Paul Hankin

On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
 On Mon, 15 Oct 2007 10:47:16 +, patrick.waldo wrote:
  my sample input file looks like this( not organized,as you see it):
  200-720-769-93-2
  kyselina mocová  C5H4N4O3

  200-001-8   50-00-0
  formaldehyd  CH2O

  200-002-3
  50-01-1
  guanidínium-chlorid  CH5N3.ClH

  etc...

 That's quite irregular so it is not that straightforward.  One way is to
 split everything into words, start a record by taking the first two
 elements and then look for the start of the next record that looks like
 three numbers concatenated by '-' characters.  Quick and dirty hack:

 import codecs
 import re

 NR_RE = re.compile(r'^\d+-\d+-\d+$')

 def iter_elements(tokens):
 tokens = iter(tokens)
 try:
 nr_a = tokens.next()
 while True:
 nr_b = tokens.next()
 items = list()
 for item in tokens:
 if NR_RE.match(item):
 yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
 nr_a = item
 break
 else:
 items.append(item)
 except StopIteration:
 yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

Maybe this is a bit more readable?

def iter_elements(tokens):
chem = []
for tok in tokens:
if NR_RE.match(tok) and len(chem) = 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
yield chem

--
Paul Hankin

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread Peter Otten

patrick.waldo wrote:

 my sample input file looks like this( not organized,as you see it):
 200-720-769-93-2
 kyselina mocová  C5H4N4O3
 
 200-001-8   50-00-0
 formaldehyd  CH2O
 
 200-002-3
 50-01-1
 guanidínium-chlorid  CH5N3.ClH

Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:

import codecs
from itertools import groupby

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt

def fields(s):
parts = s.split()
return parts[0], parts[1],  .join(parts[2:-1]), parts[-1]

def records(instream):
for key, group in groupby(instream, unicode.isspace):
if not key: 
yield .join(group)

if __name__ == __main__:
outstream = codecs.open(path2, 'w', 'utf8')
for record in records(codecs.open(path, r, utf8)):
outstream.write(|.join(fields(record)))
outstream.write(\n)

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

Wow, thank you all.  All three work. To output correctly I needed to
add:

output.write(\r\n)

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge.  Could you recommend some resources for this kind of text
manipulation?  Also, I conceptually get it, but would you mind walking
me through

 for tok in tokens:
 if NR_RE.match(tok) and len(chem) = 4:
 chem[2:-1] = [' '.join(chem[2:-1])]
 yield chem
 chem = []
 chem.append(tok)

and

 for key, group in groupby(instream, unicode.isspace):
 if not key:
 yield .join(group)


Thanks again,
Patrick



On Oct 15, 2:16 pm, Peter Otten [EMAIL PROTECTED] wrote:
 patrick.waldo wrote:
  my sample input file looks like this( not organized,as you see it):
  200-720-769-93-2
  kyselina mocová  C5H4N4O3

  200-001-8   50-00-0
  formaldehyd  CH2O

  200-002-3
  50-01-1
  guanidínium-chlorid  CH5N3.ClH

 Assuming that the records are always separated by blank lines and only the
 third field in a record may contain spaces the following might work:

 import codecs
 from itertools import groupby

 path = c:\\text_samples\\chem_1_utf8.txt
 path2 = c:\\text_samples\\chem_2.txt

 def fields(s):
 parts = s.split()
 return parts[0], parts[1],  .join(parts[2:-1]), parts[-1]

 def records(instream):
 for key, group in groupby(instream, unicode.isspace):
 if not key:
 yield .join(group)

 if __name__ == __main__:
 outstream = codecs.open(path2, 'w', 'utf8')
 for record in records(codecs.open(path, r, utf8)):
 outstream.write(|.join(fields(record)))
 outstream.write(\n)

 Peter


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread Paul Hankin

On Oct 15, 10:08 pm, [EMAIL PROTECTED] wrote:
 Because of my limited Python knowledge, I will need to try to figure
 out exactly how they work for future text manipulation and for my own
 knowledge.  Could you recommend some resources for this kind of text
 manipulation?  Also, I conceptually get it, but would you mind walking
 me through

  for tok in tokens:
  if NR_RE.match(tok) and len(chem) = 4:
  chem[2:-1] = [' '.join(chem[2:-1])]
  yield chem
  chem = []
  chem.append(tok)

Sure: 'chem' is a list of all the data associated with one chemical.
When a token (tok) arrives that is matched by NR_RE (ie 3 lots of
digits separated by dots), it's assumed that this is the start of a
new chemical if we've already got 4 pieces of data. Then, we join the
name back up (as was explained in earlier posts), and 'yield chem'
yields up the chemical so far; and a new chemical is started (by
emptying the list). Whatever tok is, it's added to the end of the
current chemical data. Add some print statements in to watch it work
if you can't get it.

This code uses exactly the same algorithm as Marc's code - it's just a
bit clearer (or at least, I thought so). Oh, and it returns a list
rather than a tuple, but that makes no difference.

--
Paul Hankin

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread Paul McGuire

On Oct 14, 8:48 am, [EMAIL PROTECTED] wrote:
 Hi all,

 I started Python just a little while ago and I am stuck on something
 that is really simple, but I just can't figure out.

 Essentially I need to take a text document with some chemical
 information in Czech and organize it into another text file.  The
 information is always EINECS number, CAS, chemical name, and formula
 in tables.  I need to organize them into lines with | in between.  So
 it goes from:

 200-763-1                     71-73-8
 nátrium-tiopentál           C11H18N2O2S.Na           to:

 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

 but if I have a chemical like: kyselina močová

 I get:
 200-720-7|69-93-2|kyselina|močová
 |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

 and then it is all off.

Pyparsing might be overkill for this example, but it is a good sample
for a demo.  If you end up doing lots of data extraction like this,
pyparsing is a useful tool.  In pyparsing, you define expressions
using pyparsing classes and built-in strings, then use the constructed
pyparsing expression to parse the data (using parseString, scanString,
searchString, or transformString).  In this example, searchString is
the easiest to use.  After the parsing is done, the parsed fields are
returned in a ParseResults object, which has some list and some dict
style behavior.  I've given each field a name based on your post, so
that you can read the tokens right out of the results as if they were
attributes of an object.  This example emits your '|' delimited data,
but the commented lines show how you could access the individually
parsed fields, too.

Learn more about pyparsing at http://pyparsing.wikispaces.com/ .

-- Paul


# -*- coding: iso-8859-15 -*-

data = 200-720-769-93-2
kyselina mocová  C5H4N4O3


200-001-8   50-00-0
formaldehyd  CH2O


200-002-3
50-01-1
guanidínium-chlorid  CH5N3.ClH



from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit

# define expressions for each part in the input data

# a numeric id starts with a number, and is followed by
# any number of numbers or '-'s
numericId = Word(nums, nums+-)

# a chemical name is one or more words, each made up of
# alphas (including 8-bit alphas) or '-'s
chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+-))

# when returning the chemical name, rejoin the separate
# words into a single string, with spaces
chemName.setParseAction(lambda t: .join(t))

# a chemical formula is a 'word' starting with an uppercase
# alpha, followed by uppercase alphas or numbers
chemFormula = Word(alphas.upper(), alphas.upper()+nums)

# put all expressions into overall form, and attach field names
entry = numericId(EINECS) + \
numericId(CAS) + \
chemName(name) + \
chemFormula(formula)

# search through input data, and print out retrieved data
for chemData in entry.searchString(data):
print %(EINECS)s|%(CAS)s|%(name)s|%(formula)s % chemData
# or print each field by itself
# print chemData.EINECS
# print chemData.CAS
# print chemData.name
# print chemData.formula
# print


prints:
200-720-7|69-93-2|kyselina mocová|C5H4N4O3
200-001-8|50-00-0|formaldehyd|CH2O
200-002-3|50-01-1|guanidínium-chlorid|CH5N3

-- 
http://mail.python.org/mailman/listinfo/python-list

Simple Text Processing Help

2007-10-14 Thread patrick . waldo

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file.  The
information is always EINECS number, CAS, chemical name, and formula
in tables.  I need to organize them into lines with | in between.  So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál   C11H18N2O2S.Na   to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina močová

I get:
200-720-7|69-93-2|kyselina|močová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []
chem_file.append(input.read())

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0   #EINECS
c=1   #CAS
ch=2  #chemical name
f=3   #formula

n=0
loop=1
x=len(words)  #counts how many words there are in the file

print '-'*100
while loop==1:
if nx and f=x:
print words[e], '|', words[c], '|', words[ch], '|', words[f],
'\n'
output.write(words[e])
output.write('|')
output.write(words[c])
output.write('|')
output.write(words[ch])
output.write('|')
output.write(words[f])
output.write('\r\n')
#increase variables by 4 to get next set
e = e + 4
c = c + 4
ch = ch + 4
f = f + 4
# increase by 1 to repeat
n=n+1
else:
loop=0

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread Marc 'BlackJack' Rintsch

On Sun, 14 Oct 2007 13:48:51 +, patrick.waldo wrote:

 Essentially I need to take a text document with some chemical
 information in Czech and organize it into another text file.  The
 information is always EINECS number, CAS, chemical name, and formula
 in tables.  I need to organize them into lines with | in between.  So
 it goes from:
 
 200-763-1 71-73-8
 nátrium-tiopentál   C11H18N2O2S.Na   to:

Is that in *one* line in the input file or two lines like shown here?

 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
 
 but if I have a chemical like: kyselina močová
 
 I get:
 200-720-7|69-93-2|kyselina|močová
 |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
 
 and then it is all off.
 
 How can I get Python to realize that a chemical name may have a space
 in it?

If the two elements before and the one element after the name can't
contain spaces it is easy:  take the first two and the last as it is and
for the name take from the third to the next to last element = the name
and join them with a space.

In [202]: parts = '123 456 a name with spaces 789'.split()

In [203]: parts[0]
Out[203]: '123'

In [204]: parts[1]
Out[204]: '456'

In [205]: ' '.join(parts[2:-1])
Out[205]: 'a name with spaces'

In [206]: parts[-1]
Out[206]: '789'

This works too if the name doesn't have a space in it:

In [207]: parts = '123 456 name 789'.split()

In [208]: parts[0]
Out[208]: '123'

In [209]: parts[1]
Out[209]: '456'

In [210]: ' '.join(parts[2:-1])
Out[210]: 'name'

In [211]: parts[-1]
Out[211]: '789'

 #read and enter into a list
 chem_file = []
 chem_file.append(input.read())

This reads the whole file and puts it into a list.  This list will
*always* just contain *one* element.  So why a list at all!?

 #split words and store them in a list
 for word in chem_file:
 words = word.split()

*If* the list would contain more than one element all would be processed
but only the last is bound to `words`.  You could leave out `chem_file` and
the loop and simply do:

words = input.read().split()

Same effect but less chatty.  ;-)

The rest of the source seems to indicate that you don't really want to read
in the whole input file at once but process it line by line, i.e. chemical
element by chemical element.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread Paul Hankin

On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote:
 Hi all,

 I started Python just a little while ago and I am stuck on something
 that is really simple, but I just can't figure out.

 Essentially I need to take a text document with some chemical
 information in Czech and organize it into another text file.  The
 information is always EINECS number, CAS, chemical name, and formula
 in tables.  I need to organize them into lines with | in between.  So
 it goes from:

 200-763-1 71-73-8
 nátrium-tiopentál   C11H18N2O2S.Na   to:

 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

 but if I have a chemical like: kyselina močová

 I get:
 200-720-7|69-93-2|kyselina|močová
 |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

 and then it is all off.

 How can I get Python to realize that a chemical name may have a space
 in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = c:\\text_samples\\chem_1_utf8.txt
path2 = c:\\text_samples\\chem_2.txt
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]
chemical = u'|'.join(tokens)
print chemical + u'\n'
output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.

--
Paul Hankin

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread patrick . waldo

Thank you both for helping me out.  I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get
tokens = line.strip().split()
[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file.  tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get TypeError: coercing to
Unicode: need string or buffer, list found.

Any ideas?

















On Oct 14, 4:25 pm, Paul Hankin [EMAIL PROTECTED] wrote:
 On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote:



  Hi all,

  I started Python just a little while ago and I am stuck on something
  that is really simple, but I just can't figure out.

  Essentially I need to take a text document with some chemical
  information in Czech and organize it into another text file.  The
  information is always EINECS number, CAS, chemical name, and formula
  in tables.  I need to organize them into lines with | in between.  So
  it goes from:

  200-763-1 71-73-8
  nátrium-tiopentál   C11H18N2O2S.Na   to:

  200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

  but if I have a chemical like: kyselina močová

  I get:
  200-720-7|69-93-2|kyselina|močová
  |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

  and then it is all off.

  How can I get Python to realize that a chemical name may have a space
  in it?

 In the original file, is every chemical on a line of its own? I assume
 it is here.

 You might use a regexp (look at the re module), or I think here you
 can use the fact that only chemicals have spaces in them. Then, you
 can split each line on whitespace (like you're doing), and join back
 together all the words between the 3rd (ie index 2) and the last (ie
 index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
 the somewhat unusual python syntax for replacing a section of a list
 with another list.

 The approach you took involves reading the whole file, and building a
 list of all the chemicals which you don't seem to use: I've changed it
 to a per-line version and removed the big lists.

 path = c:\\text_samples\\chem_1_utf8.txt
 path2 = c:\\text_samples\\chem_2.txt
 input = codecs.open(path, 'r','utf8')
 output = codecs.open(path2, 'w', 'utf8')

 for line in input:
 tokens = line.strip().split()
 tokens[2:-1] = [u' '.join(tokens[2:-1])]
 chemical = u'|'.join(tokens)
 print chemical + u'\n'
 output.write(chemical + u'\r\n')

 input.close()
 output.close()

 Obviously, this isn't tested because I don't have your chem_1_utf8.txt
 file.

 --
 Paul Hankin


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread Marc 'BlackJack' Rintsch

On Sun, 14 Oct 2007 16:57:06 +, patrick.waldo wrote:

 Thank you both for helping me out.  I am still rather new to Python
 and so I'm probably trying to reinvent the wheel here.
 
 When I try to do Paul's response, I get
tokens = line.strip().split()
 []

What is in `line`?  Paul wrote this in the body of the ``for`` loop over
all the lines in the file.

 So I am not quite sure how to read line by line.

That's what the ``for`` loop over a file or file-like object is doing. 
Maybe you should develop your script in smaller steps and do some printing
to see what you get at each step.  For example after opening the input
file:

for line in input:
print line # prints the whole line.
tokens = line.split()
print tokens   # prints a list with the split line.

 tokens = input.read().split() gets me all the information from the
 file.

Right it reads *all* of the file, not just one line.

  tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
 in the example; however, how can I loop this for the entire document?

Don't read the whole file but line by line, just like Paul showed you.

 Also, when I try output.write(tokens), I get TypeError: coercing to
 Unicode: need string or buffer, list found.

`tokens` is a list but you need to write a unicode string.  So you have to
reassemble the parts with '|' characters in between.  Also shown by Paul.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread John Machin

On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote:
 Hi all,

 I started Python just a little while ago and I am stuck on something
 that is really simple, but I just can't figure out.

 Essentially I need to take a text document with some chemical
 information in Czech and organize it into another text file.  The
 information is always EINECS number, CAS, chemical name, and formula
 in tables.  I need to organize them into lines with | in between.  So
 it goes from:

 200-763-1 71-73-8
 nátrium-tiopentál   C11H18N2O2S.Na   to:

 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

 but if I have a chemical like: kyselina močová

 I get:
 200-720-7|69-93-2|kyselina|močová
 |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

 and then it is all off.

 How can I get Python to realize that a chemical name may have a space
 in it?


Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-07 Thread Paddy

On Sep 7, 3:50 am, George Sakkis [EMAIL PROTECTED] wrote:
 On Sep 5, 5:17 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
 wrote:
 If this was a code golf challenge,

I'd choose the Unix split solution and be both maintainable as well as
concise :-)

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-06 Thread Alberto Griggio

 Thanks for making me aware of the (UNIX) split command (split -l 5
 inFile.txt), it's short, it's fast, it's beautiful.
 
 I am still wondering how to do this efficiently in Python (being kind
 of new to it... and it's not for homework).

Something like this should do the job:

def nlines(num, fileobj):
done = [False]
def doit():
for i in xrange(num):
l = fileobj.readline()
if not l:
done[0] = True
return
yield l
while not done[0]:
yield doit()

for i, group in enumerate(nlines(5, open('bigfile.txt'))):
out = open('chunk_%d.txt' % i)
for line in group:
out.write(line)


 I am still wondering how to do this in Python (being new to Python)

This is just one way of doing it, but not as concise as using split...

Alberto


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-06 Thread Arnau Sanchez

[EMAIL PROTECTED] escribió:

 I am still wondering how to do this efficiently in Python (being kind
 of new to it... and it's not for homework).

You should post some code anyway, it would be easier to give useful advice (it 
would also demonstrate that you put some effort on it).

Anyway, here is an option. Text-file objects are line-iterable, so you could 
use 
itertools (perhaps a bit difficult module for a newbie...):

from itertools import islice, takewhile, repeat

def take(it, n):
 return list(islice(it, n))

def readnlines(fd, n):
 return takewhile(bool, (take(fd, n) for _ in repeat(None)))

def splitfile(path, prefix, nlines, suffix_digits):
 sformat = %%0%dd % suffix_digits
 for index, lines in enumerate(readnlines(file(path), nlines)):
 open(%s_%s%(prefix, sformat % index), w).writelines(lines)

splitfile(/etc/services, out, 5, 4)

arnau
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-06 Thread Shawn Milochik

Here's my solution, for what it's worth:

#!/usr/bin/env python

import os

input = open(test.txt, r)

counter = 0
fileNum = 0
fileName = 

def newFileName():

global fileNum, fileName


while os.path.exists(fileName) or fileName == :
fileNum += 1
x = %0.5d % fileNum
fileName = %s.tmp % x

return fileName


for line in input:

if (fileName == ) or (counter == 5):
if fileName:
output.close()
fileName = newFileName()
counter = 0
output = open(fileName, w)

output.write(line)
counter += 1

output.close()
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-06 Thread George Sakkis

On Sep 5, 5:17 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 On Sep 5, 1:28 pm, Paddy [EMAIL PROTECTED] wrote:

  On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
  wrote:

   I have a text source file of about 20.000 lines.From this file, I like 
   to write the first 5 lines to a new file. Close

   that file, grab the next 5 lines write these to a new file... grabbing
   5 lines and creating new files until processing of all 20.000 lines is
   done.
   Is there an efficient way to do this in Python?
   In advance, thanks for your help.

  If its on unix: use split.
  If its your homework: show us what you have so far...

  - Paddy.

 Paddy,

 Thanks for making me aware of the (UNIX) split command (split -l 5
 inFile.txt), it's short, it's fast, it's beautiful.

 I am still wondering how to do this efficiently in Python (being kind
 of new to it... and it's not for homework).

 -- Martin.

 I am still wondering how to do this in Python (being new to Python)

If this was a code golf challenge, a decent entry (146 chars) could
be:

import itertools as it
for i,g in it.groupby(enumerate(open('input.txt')),lambda(i,_):i/
5):open(output.%d.txt%i,'w').writelines(s for _,s in g)

or a bit less cryptically:

import itertools as it
for chunk,enum_lines in it.groupby(enumerate(open('input.txt')),
   lambda (i,line): i//5):
  open(output.%d.txt % chunk, 'w').writelines(line for _,line
in enum_lines)


George

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-06 Thread Ricardo Aráoz

Shawn Milochik wrote:
 On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I have a text source file of about 20.000 lines.
 From this file, I like to write the first 5 lines to a new file. Close
 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?
 In advance, thanks for your help.


Maybe (untested):

def read5Lines(f):
L = f.readline()
while L :
yield (L,f.readline(),f.readline(),f.readline(),f.readline())
L = f.readline()

in = open('C:\YourFile','rb')
for fileNo, fiveLines in enumerate(read5Lines(in)) :
out = open('c:\OutFile'+str(fileNo), 'wb')
out.writelines(fiveLines)
out.close()

or something similar? (notice that in the last output file you may have
a few (4 at most) blank lines)




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread kyosohma

On Sep 5, 11:13 am, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 I have a text source file of about 20.000 lines.From this file, I like to 
 write the first 5 lines to a new file. Close

 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?
 In advance, thanks for your help.

I would use a counter in a for loop using the readline method to
iterate over the 20,000 line file. Reset the counter every 5 lines/
iterations and close the file. To name files with unique names, use
the time module. Something like this:

x = 'filename-%s.txt' % time.time()

Have fun!

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Arnau Sanchez

[EMAIL PROTECTED] escribió:

 I have a text source file of about 20.000 lines.
From this file, I like to write the first 5 lines to a new file. Close
 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?

Perhaps you could provide some code to see how you approached it?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Bjoern Schliessmann

 [EMAIL PROTECTED] wrote:

 I would use a counter in a for loop using the readline method to
 iterate over the 20,000 line file. 

file objects are iterables themselves, so there's no need to do that
by using a method.

 Reset the counter every 5 lines/ iterations and close the file. 

I'd use a generator that fetches five lines of the file per
iteration and iterate over it instead of the file directly.

 Have fun!

Definitely -- and also do your homework yourself :)

Regards,


Björn

-- 
BOFH excuse #339:

manager in the cable duct

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Shawn Milochik

On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I have a text source file of about 20.000 lines.
 From this file, I like to write the first 5 lines to a new file. Close
 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?
 In advance, thanks for your help.



I have written a working test of this. Here's the basic setup:




open the input file

function newFileName:
generate a filename (starting with 1.tmp).
If filename exists, increment and test again (0002.tmp and so on).
return fileName

read a line until input file is empty:

test to see whether I have written five lines. If so, get a new
file name, close file, and open new file

write line to file

close output file final time


Once you get some code running, feel free to post it and we'll help.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread kyosohma

On Sep 5, 11:57 am, Bjoern Schliessmann usenet-
[EMAIL PROTECTED] wrote:
  [EMAIL PROTECTED] wrote:
  I would use a counter in a for loop using the readline method to
  iterate over the 20,000 line file.

 file objects are iterables themselves, so there's no need to do that
 by using a method.

Very true! Darn it!


  Reset the counter every 5 lines/ iterations and close the file.

 I'd use a generator that fetches five lines of the file per
 iteration and iterate over it instead of the file directly.


I still haven't figured out how to use generators, so this didn't even
come to mind. I usually see something like this example for reading a
file:

f = open(somefile)
for line in f:
# do something


http://docs.python.org/tut/node9.html

Okay, so they didn't use readline. I wonder where I saw that.

  Have fun!

 Definitely -- and also do your homework yourself :)

 Regards,

 Björn

 --
 BOFH excuse #339:

 manager in the cable duct

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread James Stroud

[EMAIL PROTECTED] wrote:
 I have a text source file of about 20.000 lines.
From this file, I like to write the first 5 lines to a new file. Close
 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?

You should use a nested loop.

 In advance, thanks for your help.
 

You're welcome.

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Paddy

On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 I have a text source file of about 20.000 lines.From this file, I like to 
 write the first 5 lines to a new file. Close

 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?
 In advance, thanks for your help.

If its on unix: use split.
If its your homework: show us what you have so far...

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread [EMAIL PROTECTED]

On Sep 5, 1:28 pm, Paddy [EMAIL PROTECTED] wrote:
 On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
 wrote:

  I have a text source file of about 20.000 lines.From this file, I like to 
  write the first 5 lines to a new file. Close

  that file, grab the next 5 lines write these to a new file... grabbing
  5 lines and creating new files until processing of all 20.000 lines is
  done.
  Is there an efficient way to do this in Python?
  In advance, thanks for your help.

 If its on unix: use split.
 If its your homework: show us what you have so far...

 - Paddy.

Paddy,

Thanks for making me aware of the (UNIX) split command (split -l 5
inFile.txt), it's short, it's fast, it's beautiful.

I am still wondering how to do this efficiently in Python (being kind
of new to it... and it's not for homework).

-- Martin.


I am still wondering how to do this in Python (being new to Python)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Arnaud Delobelle

On Sep 5, 5:13 pm, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 I have a text source file of about 20.000 lines.From this file, I like to 
 write the first 5 lines to a new file. Close

 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?

Sure!

 In advance, thanks for your help.

from my_useful_functions import new_file, write_first_5_lines,
done_processing_file, grab_next_5_lines, another_new_file, write_these

in_f = open('myfile')
out_f = new_file()
write_first_5_lines(in_f, out_f) # write first 5 lines
close(out_f)
while not done_processing_file(in_f): # until done processing
   lines = grab_next_5_lines(in_f) # grab next 5 lines
   out_f = another_new_file()
   write_these(lines, out_f) # write these
   close(out_f)
print all done! # All done
print Now there are 4000 files in this directory...

Python 3.0 - ready (I've used open() instead of file())

HTH

--
Arnaud


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Steve Holden

Arnaud Delobelle wrote:
[...]
 from my_useful_functions import new_file, write_first_5_lines,
 done_processing_file, grab_next_5_lines, another_new_file, write_these
 
 in_f = open('myfile')
 out_f = new_file()
 write_first_5_lines(in_f, out_f) # write first 5 lines
 close(out_f)
 while not done_processing_file(in_f): # until done processing
lines = grab_next_5_lines(in_f) # grab next 5 lines
out_f = another_new_file()
write_these(lines, out_f) # write these
close(out_f)
 print all done! # All done
 print Now there are 4000 files in this directory...
 
 Python 3.0 - ready (I've used open() instead of file())
 
bzzt!

Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) ...
Type help, copyright, credits or license for more information.
  print all done! # All done
   File stdin, line 1
 print all done! # All done
 ^
SyntaxError: invalid syntax
 

Close, but no cigar ;-)

regards
  Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd   http://www.holdenweb.com
Skype: holdenweb  http://del.icio.us/steve.holden
--- Asciimercial --
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
--- Thank You for Reading -

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Ginger

file reading latency is mainly caused by large reading frequency, so reduction 
of the frequency of file reading may be way to solved your problem.
u may specify an read bytes count for python file object's read() method, some 
large value(like 65536) can be specified due to ur memory usage, and u can 
parse lines from read buffer freely.

have fun!

- Original Message - 
From: Shawn Milochik [EMAIL PROTECTED]
To: python-list@python.org
Sent: Thursday, September 06, 2007 1:03 AM
Subject: Re: Text processing and file creation


 On 9/5/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I have a text source file of about 20.000 lines.
 From this file, I like to write the first 5 lines to a new file. Close
 that file, grab the next 5 lines write these to a new file... grabbing
 5 lines and creating new files until processing of all 20.000 lines is
 done.
 Is there an efficient way to do this in Python?
 In advance, thanks for your help.

 
 
 I have written a working test of this. Here's the basic setup:
 
 
 
 
 open the input file
 
 function newFileName:
generate a filename (starting with 1.tmp).
If filename exists, increment and test again (0002.tmp and so on).
return fileName
 
 read a line until input file is empty:
 
test to see whether I have written five lines. If so, get a new
 file name, close file, and open new file
 
write line to file
 
 close output file final time
 
 
 Once you get some code running, feel free to post it and we'll help.
 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Text processing and file creation

2007-09-05 Thread Arnaud Delobelle

On Sep 6, 12:46 am, Steve Holden [EMAIL PROTECTED] wrote:
 Arnaud Delobelle wrote:
[...]
  print all done! # All done
  print Now there are 4000 files in this directory...

  Python 3.0 - ready (I've used open() instead of file())

 bzzt!

 Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) ...
 Type help, copyright, credits or license for more information.
   print all done! # All done
File stdin, line 1
  print all done! # All done
  ^
 SyntaxError: invalid syntax
  

Damn!  That'll teach me to make such bold claims.
At least I'm unlikely to forget again now...

--
Arnaud

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: On text processing

2007-03-24 Thread Daniel Nogradi

  I'm in a process of rewriting a bash/awk/sed script -- that grew to
  big -- in python. I can rewrite it in a simple line-by-line way but
  that results in ugly python code and I'm sure there is a simple
  pythonic way.
 
  The bash script processed text files of the form:
 
  ###
  key1value1
  key2value2
  key3value3
 
  key4value4
  spec11  spec12   spec13   spec14
  spec21  spec22   spec23   spec24
  spec31  spec32   spec33   spec34
 
  key5value5
  key6value6
 
  key7value7
  more11   more12   more13
  more21   more22   more23
 
  key8value8
  ###
 
  I guess you get the point. If a line has two entries it is a key/value
  pair which should end up in a dictionary. If a key/value pair is
  followed by consequtive lines with more then two entries, it is a
  matrix that should end up in a list of lists (matrix) that can be
  identified by the key preceeding it. The empty line after the last
  line of a matrix signifies that the matrix is finished and we are back
  to a key/value situation. Note that a matrix is always preceeded by a
  key/value pair so that it can really be identified by the key.
 
  Any elegant solution for this?


 My solution expects correctly formatted input and parses it into
 separate key/value and matrix holding dicts:


 from StringIO import StringIO

 fileText = '''\
  key1value1
 key2value2
 key3value3

 key4value4
 spec11  spec12   spec13   spec14
 spec21  spec22   spec23   spec24
 spec31  spec32   spec33   spec34

 key5value5
 key6value6

 key7value7
 more11   more12   more13
 more21   more22   more23

 key8value8
 '''
 infile = StringIO(fileText)

 keyvalues = {}
 matrices  = {}
 for line in infile:
 fields = line.strip().split()
 if len(fields) == 2:
 keyvalues[fields[0]] = fields[1]
 lastkey = fields[0]
 elif fields:
 matrices.setdefault(lastkey, []).append(fields)

 ==
 Here is the sample output:

  from pprint import pprint as pp
  pp(keyvalues)
 {'key1': 'value1',
  'key2': 'value2',
  'key3': 'value3',
  'key4': 'value4',
  'key5': 'value5',
  'key6': 'value6',
  'key7': 'value7',
  'key8': 'value8'}
  pp(matrices)
 {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
   ['spec21', 'spec22', 'spec23', 'spec24'],
   ['spec31', 'spec32', 'spec33', 'spec34']],
  'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
 'more23']]}
 

Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing :)
-- 
http://mail.python.org/mailman/listinfo/python-list

On text processing

2007-03-23 Thread Daniel Nogradi

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###
key1value1
key2value2
key3value3

key4value4
spec11  spec12   spec13   spec14
spec21  spec22   spec23   spec24
spec31  spec32   spec33   spec34

key5value5
key6value6

key7value7
more11   more12   more13
more21   more22   more23

key8value8
###

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: On text processing

2007-03-23 Thread bearophileHUGS

Daniel Nogradi:
 Any elegant solution for this?

This is my first try:

ddata = {}

inside_matrix = False
for row in file(data.txt):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]


Input file data.txt:

key1value1
key2value2
key3value3

key4value4
spec11  spec12   spec13   spec14
spec21  spec22   spec23   spec24
spec31  spec32   spec33   spec34

key5value5
key6value6

key7value7
more11   more12   more13
more21   more22   more23

key8value8


The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']


If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able  to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: On text processing

2007-03-23 Thread Daniel Nogradi

 This is my first try:

 ddata = {}

 inside_matrix = False
 for row in file(data.txt):
 if row.strip():
 fields = row.split()
 if len(fields) == 2:
 inside_matrix = False
 ddata[fields[0]] = [fields[1]]
 lastkey = fields[0]
 else:
 if inside_matrix:
 ddata[lastkey][1].append(fields)
 else:
 ddata[lastkey].append([fields])
 inside_matrix = True

 # This gives some output for testing only:
 for k in sorted(ddata):
 print k, ddata[k]


 Input file data.txt:

 key1value1
 key2value2
 key3value3

 key4value4
 spec11  spec12   spec13   spec14
 spec21  spec22   spec23   spec24
 spec31  spec32   spec33   spec34

 key5value5
 key6value6

 key7value7
 more11   more12   more13
 more21   more22   more23

 key8value8


 The output:

 key1 ['value1']
 key2 ['value2']
 key3 ['value3']
 key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
 'spec34']]]
 key5 ['value5']
 key6 ['value6']
 key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
 'more23']]]
 key8 ['value8']


 If there are many simple keys, then you can avoid creating a single
 element list for them, but then you have to tell apart the two cases
 on the base of the key (while now the presence of the second element
 is able  to tell apart the two situations). You can also use two
 different dicts to keep the two different kinds of data.

 Bye,
 bearophile

Thanks very much, it's indeed quite simple. I was lost in the
itertools documentation :)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: On text processing

2007-03-23 Thread Paddy

On Mar 23, 10:30 pm, Daniel Nogradi [EMAIL PROTECTED] wrote:
 Hi list,

 I'm in a process of rewriting a bash/awk/sed script -- that grew to
 big -- in python. I can rewrite it in a simple line-by-line way but
 that results in ugly python code and I'm sure there is a simple
 pythonic way.

 The bash script processed text files of the form:

 ###
 key1value1
 key2value2
 key3value3

 key4value4
 spec11  spec12   spec13   spec14
 spec21  spec22   spec23   spec24
 spec31  spec32   spec33   spec34

 key5value5
 key6value6

 key7value7
 more11   more12   more13
 more21   more22   more23

 key8value8
 ###

 I guess you get the point. If a line has two entries it is a key/value
 pair which should end up in a dictionary. If a key/value pair is
 followed by consequtive lines with more then two entries, it is a
 matrix that should end up in a list of lists (matrix) that can be
 identified by the key preceeding it. The empty line after the last
 line of a matrix signifies that the matrix is finished and we are back
 to a key/value situation. Note that a matrix is always preceeded by a
 key/value pair so that it can really be identified by the key.

 Any elegant solution for this?


My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:


from StringIO import StringIO

fileText = '''\
 key1value1
key2value2
key3value3

key4value4
spec11  spec12   spec13   spec14
spec21  spec22   spec23   spec24
spec31  spec32   spec33   spec34

key5value5
key6value6

key7value7
more11   more12   more13
more21   more22   more23

key8value8
'''
infile = StringIO(fileText)

keyvalues = {}
matrices  = {}
for line in infile:
fields = line.strip().split()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdefault(lastkey, []).append(fields)

==
Here is the sample output:

 from pprint import pprint as pp
 pp(keyvalues)
{'key1': 'value1',
 'key2': 'value2',
 'key3': 'value3',
 'key4': 'value4',
 'key5': 'value5',
 'key6': 'value6',
 'key7': 'value7',
 'key8': 'value8'}
 pp(matrices)
{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
  ['spec21', 'spec22', 'spec23', 'spec24'],
  ['spec31', 'spec32', 'spec33', 'spec34']],
 'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}


- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: On text processing

2007-03-23 Thread Paul McGuire

On Mar 23, 5:30 pm, Daniel Nogradi [EMAIL PROTECTED] wrote:
 Hi list,

 I'm in a process of rewriting a bash/awk/sed script -- that grew to
 big -- in python. I can rewrite it in a simple line-by-line way but
 that results in ugly python code and I'm sure there is a simple
 pythonic way.

 The bash script processed text files of the form...

 Any elegant solution for this?

Is a parser overkill?  Here's how you might use pyparsing for this
problem.

I just wanted to show that pyparsing's returned results can be
structured as more than just lists of tokens.  Using pyparsing's Dict
class (or the dictOf helper that simplifies using Dict), you can
return results that can be accessed like a nested list, like a dict,
or like an instance with named attributes (see the last line of the
example).

You can adjust the syntax definition of keys and values to fit your
actual data, for instance, if the matrices are actually integers, then
define the matrixRow as:

matrixRow = Group( OneOrMore( Word(nums) ) ) + eol


-- Paul


from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
\
Group, ZeroOrMore, OneOrMore, Optional, dictOf

data = key1value1
key2value2
key3value3


key4value4
spec11  spec12   spec13   spec14
spec21  spec22   spec23   spec24
spec31  spec32   spec33   spec34


key5value5
key6value6


key7value7
more11   more12   more13
more21   more22   more23


key8value8


# retain significant newlines (pyparsing reads over whitespace by
default)
ParserElement.setDefaultWhitespaceChars( \t)

eol = LineEnd().suppress()
elem = Word(alphas,alphanums)
key = elem
matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
matrix = Group( OneOrMore( matrixRow ) ) + eol
value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
parser = dictOf(key, value)

# parse the data
results = parser.parseString(data)

# access the results
# - like a dict
# - like a list
# - like an instance with keys for attributes
print results.keys()
print

for k in sorted(results.keys()):
print k,
if isinstance( results[k], basestring ):
print results[k]
else:
print results[k][0]
for row in results[k][1]:
print, .join(row)
print

print results.key3


Prints out:
['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

key1 value1
key2 value2
key3 value3
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
key5 value5
key6 value6
key7 value7
more11 more12 more13
more21 more22 more23
key8 value8

value3



-- 
http://mail.python.org/mailman/listinfo/python-list

Suitability for long-running text processing?

2007-01-08 Thread tsuraan


I have a pair of python programs that parse and index files on my computer
to make them searchable.  The problem that I have is that they continually
grow until my system is out of memory, and then things get ugly.  I
remember, when I was first learning python, reading that the python
interpreter doesn't gc small strings, but I assumed that was outdated and
sort of forgot about it.  Unfortunately, it seems this is still the case.  A
sample program (to type/copy and paste into the python REPL):

a=[]
for i in xrange(33,127):
for j in xrange(33,127):
 for k in xrange(33,127):
  for l in xrange(33, 127):
   a.append(chr(i)+chr(j)+chr(k)+chr(l))

del(a)
import gc
gc.collect()

The loop is deep enough that I always interrupt it once python's size is
around 250 MB.  Once the gc.collect() call is finished, python's size has
not changed a bit.  Even though there are no locals, no references at all to
all the strings that were created, python will not reduce its size.  This
example is obviously artificial, but I am getting the exact same behaviour
in my real programs.  Is there some way to convince python to get rid of all
the data that is no longer referenced, or do I need to use a different
language?

This has been tried under python 2.4.3 in gentoo linux and python 2.3 under
OS X.3.  Any suggestions/work arounds would be much appreciated.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread tsuraan


After reading
http://www.python.org/doc/faq/general/#how-does-python-manage-memory, I
tried modifying this program as below:

a=[]

for i in xrange(33,127):
 for j in xrange(33,127):
  for k in xrange(33,127):
   for l in xrange(33, 127):
a.append(chr(i)+chr(j)+chr(k)+chr(l))



import sys
sys.exc_clear()
sys.exc_traceback = sys.last_traceback = None

del(a)

import gc
gc.collect()



And it still never frees up its memory.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread Felipe Almeida Lessa

On 1/8/07, tsuraan [EMAIL PROTECTED] wrote:
[snip]
 The loop is deep enough that I always interrupt it once python's size is
 around 250 MB.  Once the gc.collect() call is finished, python's size has
 not changed a bit.
[snip]
 This has been tried under python 2.4.3 in gentoo linux and python 2.3 under
 OS X.3.  Any suggestions/work arounds would be much appreciated.

I just tried on my system

(Python is using 2.9 MiB)
 a = ['a' * (1  20) for i in xrange(300)]
(Python is using 304.1 MiB)
 del a
(Python is using 2.9 MiB -- as before)

And I didn't even need to tell the garbage collector to do its job. Some info:

$ cat /etc/issue
Ubuntu 6.10 \n \l

$ uname -r
2.6.19-ck2

$ python -V
Python 2.4.4c1

-- 
Felipe.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread tsuraan


I just tried on my system

(Python is using 2.9 MiB)
 a = ['a' * (1  20) for i in xrange(300)]
(Python is using 304.1 MiB)
 del a
(Python is using 2.9 MiB -- as before)

And I didn't even need to tell the garbage collector to do its job. Some
info:



It looks like the big difference between our two programs is that you have
one huge string repeated 300 times, whereas I have thousands of
four-character strings.  Are small strings ever collected by python?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread Felipe Almeida Lessa

On 1/8/07, tsuraan [EMAIL PROTECTED] wrote:


  I just tried on my system
 
  (Python is using 2.9 MiB)
   a = ['a' * (1  20) for i in xrange(300)]
  (Python is using 304.1 MiB)
   del a
  (Python is using 2.9 MiB -- as before)
 
  And I didn't even need to tell the garbage collector to do its job. Some
 info:

 It looks like the big difference between our two programs is that you have
 one huge string repeated 300 times, whereas I have thousands of
 four-character strings.  Are small strings ever collected by python?

In my test there are 300 strings of 1 MiB, not a huge string repeated. However:

$ python
Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2
Type help, copyright, credits or license for more information.
 # Python is using 2.7 MiB
... a = ['1234' for i in xrange(10  20)]
 # Python is using 42.9 MiB
... del a
 # Python is using 2.9 MiB

With 10,485,760 strings of 4 chars, it still works as expected.

-- 
Felipe.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread Chris Mellon

On 1/8/07, Felipe Almeida Lessa [EMAIL PROTECTED] wrote:
 On 1/8/07, tsuraan [EMAIL PROTECTED] wrote:
 
 
   I just tried on my system
  
   (Python is using 2.9 MiB)
a = ['a' * (1  20) for i in xrange(300)]
   (Python is using 304.1 MiB)
del a
   (Python is using 2.9 MiB -- as before)
  
   And I didn't even need to tell the garbage collector to do its job. Some
  info:
 
  It looks like the big difference between our two programs is that you have
  one huge string repeated 300 times, whereas I have thousands of
  four-character strings.  Are small strings ever collected by python?

 In my test there are 300 strings of 1 MiB, not a huge string repeated. 
 However:

 $ python
 Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)
 [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2
 Type help, copyright, credits or license for more information.
  # Python is using 2.7 MiB
 ... a = ['1234' for i in xrange(10  20)]
  # Python is using 42.9 MiB
 ... del a
  # Python is using 2.9 MiB

 With 10,485,760 strings of 4 chars, it still works as expected.

 --
 Felipe.
 --

Have you actually ran the OPs code? It has clearly different behavior
than what you are posting, and the OPs code, to me at least, seems
much more representative of real-world code. In your second case, you
have the *same* string 10,485,760 times, in the OPs case each string
is different.

My first thought was that interned strings were causing the growth,
but that doesn't seem to be the case. Regardless, what he's posting is
clearly different, and has different behavior, than what he is
posting. If you don't see the memory leak when you run the code he
posted (the *same* code) that'd be important information.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread tsuraan


$ python
Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2
Type help, copyright, credits or license for more information.
 # Python is using 2.7 MiB
... a = ['1234' for i in xrange(10  20)]
 # Python is using 42.9 MiB
... del a
 # Python is using 2.9 MiB

With 10,485,760 strings of 4 chars, it still works as expected.



Have you tried running the code I posted?  Is there any explanation as to
why the code I posted fails to ever be cleaned up?
In your specific example, you have a huge array of pointers to a single
string.  Try doing a[0] is a[1].  You'll get True.  Try a[0] is
'1'+'2'+'3'+'4'.  You'll get False.  Every element of a is a pointer to the
exact same string.  When you delete a, you're getting rid of a huge array of
pointers, but probably not actually losing the four-byte (plus gc overhead)
string '1234'.

So, does anybody know how to get python to free up _all_ of its allocated
strings?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread tsuraan


My first thought was that interned strings were causing the growth,
but that doesn't seem to be the case.



Interned strings, as of 2.3, are no longer immortal, right?  The intern doc
says you have to keep a reference around to the string now, anyhow.  I
really wish I could find that thing I read a year and a half ago about
python never collecting small strings, but I just can't find it anymore.
Maybe it's time for me to go source diving...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread Chris Mellon

On 1/8/07, tsuraan [EMAIL PROTECTED] wrote:


  My first thought was that interned strings were causing the growth,
  but that doesn't seem to be the case.

 Interned strings, as of 2.3, are no longer immortal, right?  The intern doc
 says you have to keep a reference around to the string now, anyhow.  I
 really wish I could find that thing I read a year and a half ago about
 python never collecting small strings, but I just can't find it anymore.
 Maybe it's time for me to go source diving...



I remember something about it coming up in some of the discussions of
free lists and better behavior in this regard in 2.5, but I don't
remember the details.

Interned strings aren't supposed to be immortal, these strings
shouldn't be automatically interned anyway (and my brief testing
seemed to bear that out) and calling _Py_ReleaseInternedStrings didn't
recover any memory, so I'm pretty sure interning is not the culprit.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Suitability for long-running text processing?

2007-01-08 Thread tsuraan


I remember something about it coming up in some of the discussions of
free lists and better behavior in this regard in 2.5, but I don't
remember the details.



Under Python 2.5, my original code posting no longer exhibits the bug - upon
calling del(a), python's size shrinks back to ~4 MB, which is its starting
size.  I guess I'll see how painful it is to migrate a gentoo system to
2.5... Thanks for the hint :)
-- 
http://mail.python.org/mailman/listinfo/python-list

Beginner question on text processing

2006-12-29 Thread Doran, Harold

I am beginning to use python primarily to organize data into formats
needed for input into some statistical packages. I do not have much
programming experience outside of LaTeX and R, so some of this is a bit
new. I am attempting to write a program that reads in a text file that
contains some values and it would then output a new file that has
manipulated this original text file in some manner.

To illustrate, assume I have a text file, call it test.txt, with the
following information:

X11 .32
X22 .45

My goal in the python program is to manipulate this file such that a new
file would be created that looks like:

X11 IPB = .32
X22 IPB = .45

Here is what I have accomplished so far.

# Python code below for sample program called 'test.py'

# Read in a file with the item parameters
filename = raw_input(Please enter the file you want to open: )

params = open(filename, 'r')

for i in params:
print 'IPB = ' ,i
# end code

This obviously results in the following:

IPB =  x11  .32
IPB =  x22  .45

So, my questions may be trivial, but:

1) How do I print the 'IPB = ' before the numbers? 
2) Is there a better way to prompt the user to open the desired file
rather than the way I have it above? For example, is there a built-in
function that would open a windows dialogue box such that a user who
does not know about path names can use windows to look for the file and
click on it. 
3) Last, what is the best way to have the output saved as a new file
called 'test2.txt'. The only way I know how to do this now is to do
something like:

python test.py  test2.txt

Thank you for any help
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Beginner question on text processing

2006-12-29 Thread skip


Harold To illustrate, assume I have a text file, call it test.txt, with
Harold the following information:

Harold X11 .32
Harold X22 .45

Harold My goal in the python program is to manipulate this file such
Harold that a new file would be created that looks like:

Harold X11 IPB = .32
Harold X22 IPB = .45

...

This is a problem with a number of different solutions.  Here's one way to
do it:

for line in open(filename, r):
fields = line.split()
print fields[0], IPB =, fields[1]

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list

fast text processing

2006-02-21 Thread Alexis Gallagher

(I tried to post this yesterday but I think my ISP ate it. Apologies if 
this is a double-post.)

Is it possible to do very fast string processing in python? My 
bioinformatics application needs to scan very large ASCII files (80GB+), 
compare adjacent lines, and conditionally do some further processing. I 
believe the disk i/o is the main bottleneck so for now that's what I'm 
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open(data,'r',buffering=1000)

lastLine = filehandle.readline()

for currentLine in filehandle.readlines():

 lastTokens = lastLine.strip().split(delimiter)
 currentTokens = currentLine.strip().split(delimiter)

 lastGeno = extract(lastTokens[0])
 currentGeno = extract(currentTokens[0])

 # prepare for next iteration
 lastLine = currentLine

 if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems 
wrong. What's the best way to speed this up? Can I switch to some fast 
byte-oriented immutable string library? Are there optimizing compilers? 
Are there better ways to prep the file handle?

Perhaps this is a job for C, but I am of that soft generation which 
fears memory management. I'd need to learn how to do buffered reading in 
C, how to wrap the C in python, and how to let the C call back into 
python to call markEquivalent(). It sounds painful. I _have_ done some 
benchmark comparisons of only the underlying line-based file reading 
against a Common Lisp version, but I doubt I'm using the optimal 
construct in either language so I hesitate to trust my results, and 
anyway the interlanguage bridge will be even more obscure in that case.

Much obliged for any help,
Alexis
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: fast text processing

2006-02-21 Thread Steve Holden

Alexis Gallagher wrote:
 (I tried to post this yesterday but I think my ISP ate it. Apologies if 
 this is a double-post.)
 
 Is it possible to do very fast string processing in python? My 
 bioinformatics application needs to scan very large ASCII files (80GB+), 
 compare adjacent lines, and conditionally do some further processing. I 
 believe the disk i/o is the main bottleneck so for now that's what I'm 
 optimizing. What I have now is roughly as follows (on python 2.3.5).
 
 filehandle = open(data,'r',buffering=1000)

This buffer size seems, shall we say, unadventurous? It's likely to slow 
things down considerably, since the filesystem is probably going to 
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
 
 lastLine = filehandle.readline()
 
I'd suggest

lastTokens = filehandle.readline().strip().split(delimiter)

here. You have no need of the line other than to split it into tokens.

 for currentLine in filehandle.readlines():
 
Note that this is going to read the whole file in to (virtual) memory 
before entering the loop. I somehow suspect you'd rather avoid this if 
you could. I further suspect your testing has been with smaller files 
than 80GB ;-). You might want to consider

for currentLine in filehandle:

as an alternative. This uses the file's generator properties to produce 
the next line each time round the loop.

  lastTokens = lastLine.strip().split(delimiter)

The line above goes away if you adopt the loop initialization suggestion 
above. Otherwise you are repeating the splitting of each line twice, 
once as the current line then again as the last line.

  currentTokens = currentLine.strip().split(delimiter)
 
  lastGeno = extract(lastTokens[0])
  currentGeno = extract(currentTokens[0])
 
If the extract() operation is stateless (in other words if it always 
produces the same output for a given input) then again you are 
unecessarily repeating yourself here by running extract() on the same 
data as the current first token and the last first token (if you see 
what I mean).

I might also observe that you seem to expect only two tokens per line. 
If this is invariable the case then you might want to consider writing 
an unpacking assignment instead, such as

cToken0, cToken1, newline = currentLine.strip().split(delimiter)

to save the indexing. Not a big deal, thugh, and it *will* break if you 
have more than one delimiter in a line, as the interpreter won;t then 
know what to do with the third and subsequent elements.

  # prepare for next iteration
  lastLine = currentLine
 
Of course now you are going to try and strip the delimiter off the line 
and split it again when you loop around again. You should now just be 
able to say

 lastTokens = currentTokens

instead.

  if lastGeno == currentGeno:
 table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))
 
 So on every iteration I'm processing mutable strings -- this seems 
 wrong. What's the best way to speed this up? Can I switch to some fast 
 byte-oriented immutable string library? Are there optimizing compilers? 
 Are there better ways to prep the file handle?
 
I'm sorry but I am not sure where the mutable strings come in. Python 
strings are immutable anyway. Well-known for it. It might be a slight 
problem that you are creating a second terminator-less copy of each 
line, but that's probably inevitable.

Of course you leave us in the dark about the nature of 
table.markEquivalent as well. Depending on the efficiency of the 
extract() operation you might want to consider short-circuiting the loop 
if the two tokens have already been marked as equivalent. That might be 
a big win or not depending on relative efficiency.

 Perhaps this is a job for C, but I am of that soft generation which 
 fears memory management. I'd need to learn how to do buffered reading in 
 C, how to wrap the C in python, and how to let the C call back into 
 python to call markEquivalent(). It sounds painful. I _have_ done some 
 benchmark comparisons of only the underlying line-based file reading 
 against a Common Lisp version, but I doubt I'm using the optimal 
 construct in either language so I hesitate to trust my results, and 
 anyway the interlanguage bridge will be even more obscure in that case.
 
Probably the biggest gain will be in simply not reading the whole file 
into memory by calling its .readlines() method.

Summarising. consider something more like:

filehandle = open(data,'r',buffering=64*1024)
# You could also try just leaving the buffering spec out

lastTokens = filehandle.readline().strip().split(delim)
lastGeno = extract(lastTokens[0])

for currentLine in filehandle:

 currentTokens = currentLine.strip().split(delim)
 currentGeno = extract(currentTokens[0])
 if lastGeno == currentGeno:
 table.markEquivalent(int(lastTokens[1]), int(currentTokens[1]))

 lastGeno = currentGeno
 lastTokens = currentTokens

 Much obliged for any help,

Re: fast text processing

2006-02-21 Thread Ben Sizer

Maybe this code will be faster? (If it even does the same thing:
largely untested)


filehandle = open(data,'r',buffering=1000)
fileIter = iter(filehandle)

lastLine = fileIter.next()
lastTokens = lastLine.strip().split(delimiter)
lastGeno = extract(lastTokens[0])

for currentLine in fileIter:
currentTokens = currentLine.strip().split(delimiter)
currentGeno = extract(currentTokens[0])

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

# prepare for next iteration
lastLine = currentLine
lastTokens = currentTokens
lastGeno = currentGeno


I'd be tempted to try a bigger file buffer too, personally.

--
Ben Sizer

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: fast text processing

2006-02-21 Thread Alexis Gallagher

Steve,

First, many thanks!

Steve Holden wrote:
 Alexis Gallagher wrote:

 filehandle = open(data,'r',buffering=1000)
 
 This buffer size seems, shall we say, unadventurous? It's likely to slow 
 things down considerably, since the filesystem is probably going to 
 naturally wnt to use a rather larger value. I'd suggest a 64k minumum.

Good to know. I should have dug into the docs deeper. Somehow I thought 
it listed lines not bytes.

 for currentLine in filehandle.readlines():

 Note that this is going to read the whole file in to (virtual) memory 
 before entering the loop. I somehow suspect you'd rather avoid this if 
 you could. I further suspect your testing has been with smaller files 
 than 80GB ;-). You might want to consider
 

Oops! Thanks again. I thought that readlines() was the generator form, 
based on the docstring comments about the deprecation of xreadlines().

 So on every iteration I'm processing mutable strings -- this seems 
 wrong. What's the best way to speed this up? Can I switch to some fast 
 byte-oriented immutable string library? Are there optimizing 
 compilers? Are there better ways to prep the file handle?

 I'm sorry but I am not sure where the mutable strings come in. Python 
 strings are immutable anyway. Well-known for it.

I misspoke. I think was mixing this up with the issue of object-creation 
overhead for all of the string handling in general. Is this a bottleneck 
to string processing in python, or is this a hangover from my Java days? 
I would have thought that dumping the standard string processing 
libraries in favor of byte manipulation would have been one of the 
biggest wins.

 Of course you leave us in the dark about the nature of 
 table.markEquivalent as well.

markEquivalent() implements union-join (aka, uptrees) to generate 
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything. 
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For 
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm 
guessing that both this tokenizing and the uptree implementations sound 
like good candidates for one of those tools, once I shake out these 
algorithmic problems.


alexis
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: fast text processing

2006-02-21 Thread Larry Bates



Alexis Gallagher wrote:
 Steve,
 
 First, many thanks!
 
 Steve Holden wrote:
 Alexis Gallagher wrote:

 filehandle = open(data,'r',buffering=1000)

 This buffer size seems, shall we say, unadventurous? It's likely to
 slow things down considerably, since the filesystem is probably going
 to naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
 
 Good to know. I should have dug into the docs deeper. Somehow I thought
 it listed lines not bytes.
 
 for currentLine in filehandle.readlines():

 Note that this is going to read the whole file in to (virtual) memory
 before entering the loop. I somehow suspect you'd rather avoid this if
 you could. I further suspect your testing has been with smaller files
 than 80GB ;-). You might want to consider

 
 Oops! Thanks again. I thought that readlines() was the generator form,
 based on the docstring comments about the deprecation of xreadlines().
 
 So on every iteration I'm processing mutable strings -- this seems
 wrong. What's the best way to speed this up? Can I switch to some
 fast byte-oriented immutable string library? Are there optimizing
 compilers? Are there better ways to prep the file handle?

 I'm sorry but I am not sure where the mutable strings come in. Python
 strings are immutable anyway. Well-known for it.
 
 I misspoke. I think was mixing this up with the issue of object-creation
 overhead for all of the string handling in general. Is this a bottleneck
 to string processing in python, or is this a hangover from my Java days?
 I would have thought that dumping the standard string processing
 libraries in favor of byte manipulation would have been one of the
 biggest wins.
 
 Of course you leave us in the dark about the nature of
 table.markEquivalent as well.
 
 markEquivalent() implements union-join (aka, uptrees) to generate
 equivalence classes. Optimising that was going to be my next task
 
 I feel a bit silly for missing the double-processing of everything.
 Thanks for pointing that out. And I will check out the biopython package.
 
 I'm still curious if optimizing compilers are worth examining. For
 instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
 guessing that both this tokenizing and the uptree implementations sound
 like good candidates for one of those tools, once I shake out these
 algorithmic problems.
 
 
 alexis

When your problem is I/O bound there is almost nothing that can be
done to speed it up without some sort of refactoring of the input
data itself.  Python reads bytes off a hard drive just as fast as
any compiled language.  A good test is to copy the file and measure
the time.  You can't make your program run any faster than a copy
of the file itself without making hardware changes (e.g. RAID
arrays, etc.).

You might also want to take a look at csv module.  Reading lines
and splitting on delimeters is almost always handled well by csv.

-Larry Bates
-- 
http://mail.python.org/mailman/listinfo/python-list

Newbie Text Processing Question

2005-10-04 Thread gshepherd281

Hi,

I'm a total newbie to Python so any and all advice is greatly
appreciated.

I'm trying to use regular expressions to process text in an SGML file
but only in one section.

So the input would look like this:

ch-part no=ItitleRESEARCH GUIDE
sec-main no=1.01titlecontent
paracontent

sec-main no=2.01titlecontent
paracontent


ch-part no=IItitleFORMS
sec-main no=3.01titlecontent

sec-sub1 no=1titlecontent
paracontent

sec-sub2 no=1titlecontent
paracontent


and the output like this:

ch-part no=ItitleRESEARCH GUIDE
sec-main no=1.01titlecontent
biblio
paracontent
/biblio

sec-main no=2.01titlecontent
biblio
paracontent
/biblio

ch-part no=IItitleFORMS
sec-main no=3.01titlecontent

sec-sub1 no=1titlecontent
paracontent

sec-sub2 no=1titlecontent
paracontent


But no matter what I try I end up changing the entire file rather than
just one part.

Here's what I've come up with so far but I can't think of anything
else.

***

import os, re
setpath = raw_input(Enter the path where the program should run: )
print

for root, dirs, files in os.walk(setpath):
 fname = files
 for fname in files:
  inputFile = file(os.path.join(root,fname), 'r')
  line = inputFile.read()
  inputFile.close()


  chpart_pattern = re.compile(r'ch-part
no=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE)

  while 1:
   if chpart_pattern.search(line):
line = re.sub(rsec-main
no=(\[0-9]*.[0-9]*\)title(.*), rsec-main
no=\1title\2\nbiblio, line)
outputFile = file(os.path.join(root,fname), 'w')
outputFile.write(line)
outputFile.close()
break

   if chpart_pattern.search(line) is None:
print 'none'
break

Thanks,

Greg

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Newbie Text Processing Question

2005-10-04 Thread Gregory Piñero

That's how Python works. You read in the whole file, edit it, and
write it back out. As far as I know there's no way to edit a file
in place which I'm assuming is what you're asking?

And now, cue the responses telling you to use a fancy parser (XML?) for your project ;-)

-Greg
On 4 Oct 2005 20:04:39 -0700, [EMAIL PROTECTED] 
[EMAIL PROTECTED] wrote:Hi,I'm a total newbie to Python so any and all advice is greatly
appreciated.I'm trying to use regular expressions to process text in an SGML filebut only in one section.So the input would look like this:ch-part no=ItitleRESEARCH GUIDE
sec-main no=1.01titlecontentparacontentsec-main no=2.01titlecontentparacontentch-part no=IItitleFORMS
sec-main no=3.01titlecontentsec-sub1 no=1titlecontentparacontentsec-sub2 no=1titlecontentparacontent
and the output like this:ch-part no=ItitleRESEARCH GUIDEsec-main no=1.01titlecontentbiblioparacontent/biblio
sec-main no=2.01titlecontentbiblioparacontent/biblioch-part no=IItitleFORMSsec-main no=3.01titlecontent
sec-sub1 no=1titlecontentparacontentsec-sub2 no=1titlecontentparacontentBut no matter what I try I end up changing the entire file rather than
just one part.Here's what I've come up with so far but I can't think of anythingelse.***import os, resetpath = raw_input(Enter the path where the program should run: )print
for root, dirs, files in os.walk(setpath): fname = files for fname in files:inputFile = file(os.path.join(root,fname), 'r')line = inputFile.read()inputFile.close
()chpart_pattern = re.compile(r'ch-partno=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE)while 1: if chpart_pattern.search(line):line
= re.sub(rsec-mainno=(\[0-9]*.[0-9]*\)title(.*), rsec-mainno=\1title\2\nbiblio, line)outputFile
= file(os.path.join(root,fname), 'w')outputFile.write(line)outputFile.close()break if chpart_pattern.search(line) is None:
print
'none'breakThanks,Greg--http://mail.python.org/mailman/listinfo/python-list
-- Gregory PiñeroChief Innovation OfficerBlended Technologies(www.blendedtechnologies.com)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Newbie Text Processing Question

2005-10-04 Thread James Stroud

You can edit a file in place, but it is not applicable to what you are doing. 
As soon as you insert the first biblio, you've shifted everything 
downstream by those 8 bytes. Since they map to a physically located blocks on 
a physical drive, you will have to rewrite those blocks. If it is a big file 
you can do something conceptually similar to piping, where the original file 
is read in line by line and a new file is created:

afile = open(somefile.xml)
newfile = open(somenewfile.xml, w)
for aline in afile:
  if tests_positive(aline):
newfile.write(make_the_prelude(aline))
newfile.write(aline)
newfile.write(make_the_afterlude(aline))
  else:
newfile.write(aline)
afile.close()
newfile.close()

James

On Tuesday 04 October 2005 20:13, Gregory Piñero wrote:
 That's how Python works. You read in the whole file, edit it, and write it
 back out. As far as I know there's no way to edit a file in place which
 I'm assuming is what you're asking?

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Newbie Text Processing Question

2005-10-04 Thread Mike Meyer

[EMAIL PROTECTED] writes:
 I'm a total newbie to Python so any and all advice is greatly
 appreciated.

Well, I've got some for you.

 I'm trying to use regular expressions to process text in an SGML file
 but only in one section.

This is generally a bad idea. SGML family languages aren't easy to
parse - even the ones that were designed to be easy to parse - and
generally require very complex regular expessions to get right. It may
be that your SGML data can be parsed by the re you use, but there
are almost certainly valid SGML documents that your parser will not
properly parse.

In general, it's better to use a parser for the language in question.

 So the input would look like this:

 ch-part no=ItitleRESEARCH GUIDE
 sec-main no=1.01titlecontent
 paracontent

 sec-main no=2.01titlecontent
 paracontent


 ch-part no=IItitleFORMS
 sec-main no=3.01titlecontent

 sec-sub1 no=1titlecontent
 paracontent

 sec-sub2 no=1titlecontent
 paracontent


This is funny-looking SGML. Are the the end tags really optional for
all the tag types?

 But no matter what I try I end up changing the entire file rather than
 just one part.

Other have explained why you can't do that, so I'll skip it.

 Here's what I've come up with so far but I can't think of anything
 else.

 ***

 import os, re
 setpath = raw_input(Enter the path where the program should run: )
 print

 for root, dirs, files in os.walk(setpath):
  fname = files
  for fname in files:
   inputFile = file(os.path.join(root,fname), 'r')
   line = inputFile.read()
   inputFile.close()


   chpart_pattern = re.compile(r'ch-part
 no=\[A-Z]{1,4}\title(RESEARCH)', re.IGNORECASE)

This makes a number of assumptions that are invalid about SGML in
general, but may be valid for your sample text - how attributes are
quoted, the lack of line breaks, which can be added without changing
the content, and the format of the no attribute.

   while 1:
if chpart_pattern.search(line):
 line = re.sub(rsec-main
 no=(\[0-9]*.[0-9]*\)title(.*), rsec-main
 no=\1title\2\nbiblio, line)

Ditto.

Heren's an sgmllib solution that gets does what you do above, except
it writes it to standard out:

#!/usr/bin/env python

import sys
from sgmllib import SGMLParser

datain = 
ch-part no=ItitleRESEARCH GUIDE
sec-main no=1.01titlecontent
paracontent

sec-main no=2.01titlecontent
paracontent


ch-part no=IItitleFORMS
sec-main no=3.01titlecontent

sec-sub1 no=1titlecontent
paracontent

sec-sub2 no=1titlecontent
paracontent


class Parser(SGMLParser):

def __init__(self):
# install the handlers with funny names
setattr(self, start_ch-part, self.handle_ch_part)

# And start with chapter 0
self.ch_num = 0

SGMLParser.__init__(self)

def format_attributes(self, attributes):
return ['%s=%s' % pair for pair in attributes]

def unknown_starttag(self, tag, attributes):
taglist = self.format_attributes(attributes)
taglist.insert(0, tag)
sys.stdout.write('%s' % ' '.join(taglist))

def handle_data(self, data):
sys.stdout.write(data)

def handle_ch_part(self, attributes):
This should be called start_ch-part, but, well, you know.

self.unknown_starttag('ch-part', attributes)
for name, value in attributes:
if name == 'no':
self.ch_num = value

def start_para(self, attributes):
if self.ch_num == 'I':
sys.stdout.write('biblio\n')
self.unknown_starttag('para', attributes)


parser  = Parser()
parser.feed(datain)
parser.close()


sgmllib isn't a very good SGML parser - it was written to support
htmllib, and really only handles that subset of sgml well. In
particular, it doesn't really understand DTDs, so can't handle the
missing end tags in your example. You may be able to work around that.

If you can coerce this to XML, then the xml tools in the standard
library will work well. For HTML, I like BeautifulSoup, but that's
mostly because it deals with all the crud on the net that is passed
off as HTML. For SGML - well, I don't have a good answer. Last time I
had to deal with real SGML, I used a C parser that spat out a parse
tree that could be parsed properly.

 mike
-- 
Mike Meyer [EMAIL PROTECTED]  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Newbie Text Processing Question

2005-10-04 Thread Fredrik Lundh

Gregory Piñero wrote:

That's how Python works. You read in the whole file, edit it, and write it
 back out.

that's how file systems work.  if file systems generally supported insert
operations, Python would of course support that feature.

/F



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread Paul McGuire

Even though you are using re's to try to look for specific substrings
(which you sort of fake in by splitting on Identifier, and then
prepending Identifier to every list element, so that the re will
match...), this program has quite a few holes.

What if the word Identifier is inside one of the quoted strings?
What if the actual value is tablename10?  This will match your
tablename1 string search, but it is certainly not what you want.
Did you know there are trailing blanks on your table names, which could
prevent any program name from matching?

So here is an alternative approach using, as many have probably
predicted by now if they've spent any time on this list, the pyparsing
module.  You may ask, isn't a parser overkill for this problem? and
the answer will likely be probably, but in the case of pyparsing, I'd
answer probably, but it is so easy, and takes care of so much junk
like dealing with quoted strings and intermixed data, so, who cares if
it's overkill?

So here is the 20-line pyparsing solution, insert it into your program
after you have read in tlst, and read in the input data using something
like data = file('plst).read().  (The first line strips the whitespace
from the ends of your table names.)

tlist = map(str.rstrip, tlist)

from pyparsing import quotedString,LineStart,LineEnd,removeQuotes
quotedString.setParseAction( removeQuotes )

identLine = (LineStart() + Identifier + quotedString +
LineEnd()).setResultsName(identifier)
tableLine = (LineStart() + Value + quotedString +
LineEnd()).setResultsName(tableref)

interestingLines = ( identLine | tableLine )
thisprog = 
for toks,start,end in interestingLines.scanString( data ):
toktype = toks.getName()
if toktype == 'identifier':
thisprog = toks[1]
elif toktype == 'tableref':
thistable = toks[1]
if thistable in tlist:
print '%s,%s' % (thisprog, thistable)
else:
print Not, thisprog, contains wrong table
(+thistable+)

This program will print out:
Program1,tablename2
Program 2,tablename2


Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread Miki Tebeka

Hello pruebauno,

 import re
 f=file('tlst')
 tlst=f.read().split('\n')
 f.close()
tlst = open(tlst).readlines()

 f=file('plst')
 sep=re.compile('Identifier (.*?)')
 plst=[]
 for elem in f.read().split('Identifier'):
   content='Identifier'+elem
   match=sep.search(content)
   if match:
   plst.append((match.group(1),content))
 f.close()
Look at re.findall, I think it'll be easier.

 flst=[]
 for table in tlst:
   for prog,content in plst:
   if content.find(table)0:
if table in content:
   flst.append('%s,%s'%(prog,table))

 flst.sort()
 for elem in flst:
   print elem
print \n.join(sorted(flst))

HTH.
--

Miki Tebeka [EMAIL PROTECTED]
http://tebeka.bizhat.com
The only difference between children and adults is the price of the toys


pgp9Fde43cw8j.pgp
Description: PGP signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread pruebauno

Paul McGuire wrote:
 match...), this program has quite a few holes.

 What if the word Identifier is inside one of the quoted strings?
 What if the actual value is tablename10?  This will match your
 tablename1 string search, but it is certainly not what you want.
 Did you know there are trailing blanks on your table names, which could
 prevent any program name from matching?

Good point. I did not think about that. I got lucky because none of the
table names had trailing blanks (google groups seems to add those) the
word identifier is not used inside of quoted strings anywhere and  I do
not have tablename10, but I do have dba.tablename1 and that one has
to match with tablename1 (and magically did).


 So here is an alternative approach using, as many have probably
 predicted by now if they've spent any time on this list, the pyparsing
 module.  You may ask, isn't a parser overkill for this problem? and

You had to plug pyparsing! :-). Thanks for the info I did not know
something like pyparsing existed. Thanks for the code too, because
looking at the module it was not totally obvious to me how to use it. I
tried run it though and it is not working for me. The following code
runs but prints nothing at all:

import pyparsing as prs

f=file('tlst'); tlst=[ln.strip() for ln in f if ln]; f.close()
f=file('plst'); plst=f.read()  ; f.close()

prs.quotedString.setParseAction(prs.removeQuotes)

identLine=(prs.LineStart()
  + 'Identifier'
  + prs.quotedString
  + prs.LineEnd()
  ).setResultsName('prog')

tableLine=(prs.LineStart()
  + 'Value'
  + prs.quotedString
  + prs.LineEnd()
  ).setResultsName('table')

interestingLines=(identLine | tableLine)

for toks,start,end in interestingLines.scanString(plst):
print toks,start,end

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread pruebauno

Miki Tebeka wrote:

 Look at re.findall, I think it'll be easier.

Minor changes aside the interesting thing, as you pointed out, would be
using re.findall. I could not figure out how to.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread pruebauno

[EMAIL PROTECTED] wrote:
 Paul McGuire wrote:
  match...), this program has quite a few holes.

 tried run it though and it is not working for me. The following code
 runs but prints nothing at all:

 import pyparsing as prs

And this is the point where I have to post the real stuff because your
code works with the example i posted and not with the real thing. The
identifier I am interested in is (if I understood the the requirements
correctly) the one after the title with the stars

So here is the real data for tlst some info replaced with z to
protect privacy:

*


   Identifier zzz0main


*


   Identifier zz501


 Value zzz_CLCL_,zz_ID


 Name z


 Name zz


*


   Identifier 3main


*


   Identifier zzz505


 Value dba.zzz_CKPY__SUM


 Name xxx_xxx_xxx_DT


--


 Value zzz__zzz_zzz


 Name zzz_zz_zzz


--


 Value zzz_zzz_zzz_HIST


 Name zzz_zzz


--


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving my text processing script

2005-09-01 Thread Paul McGuire

Yes indeed, the real data often has surprising differences from the
simulations! :)

It turns out that pyparsing LineStart()'s are pretty fussy.  Usually,
pyparsing is very forgiving about whitespace between expressions, but
it turns out that LineStart *must* be followed by the next expression,
with no leading whitespace.

Fortunately, your syntax is really quite forgiving, in that your
key-value pairs appear to always be an unquoted word (for the key) and
a quoted string (for the value).  So you should be able to get this
working just by dropping the LineStart()'s from your expressions, that
is:

identLine=('Identifier'
  + prs.quotedString
  + prs.LineEnd()
  ).setResultsName('prog')


tableLine=('Value'
  + prs.quotedString
  + prs.LineEnd()
  ).setResultsName('table')

See if that works any better for you.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Improving my text processing script

2005-08-31 Thread pruebauno

I am sure there is a better way of writing this, but how?

import re
f=file('tlst')
tlst=f.read().split('\n')
f.close()
f=file('plst')
sep=re.compile('Identifier (.*?)')
plst=[]
for elem in f.read().split('Identifier'):
content='Identifier'+elem
match=sep.search(content)
if match:
plst.append((match.group(1),content))
f.close()
flst=[]
for table in tlst:
for prog,content in plst:
if content.find(table)0:
flst.append('%s,%s'%(prog,table))
flst.sort()
for elem in flst:
print elem



What would be the best way of writing this program. BTW find0 to check
in case table=='' (empty line) so I do not include everything.

tlst is of the form:

tablename1
tablename2

...

plst is of the form:

Identifier Program1
Name Random Stuff
Value tablename2
...other random properties
Name More Random Stuff
Identifier Program 2
Name Yet more stuff
Value tablename2
...


I want to know in what programs are the tables in tlst (and only those)
used.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: text processing problem

2005-04-08 Thread Matt


Maurice LING wrote:
 Matt wrote:
  I'd HIGHLY suggest purchasing the excellent a
  href=http://www.oreilly.com/catalog/regex2/index.html;Mastering
  Regular Expressions/a by Jeff Friedl.  Although it's mostly
geared
  towards Perl, it will answer all your questions about regular
  expressions.  If you're going to work with regexs, this is a
must-have.
 
  That being said, here's what the new regular expression should be
with
  a bit of instruction (in the spirit of teaching someone to fish
after
  giving them a fish ;-)   )
 
  my_expr = re.compile(r'(\w+)\s*(\(\1\))')
 
  Note the \s*, in place of the single space  .  The \s means
any
  whitespace character (equivalent to [ \t\n\r\f\v]).  The *
following
  it means 0 or more occurances.  So this will now match:
 
  there  (there)
  there (there)
  there(there)
  there  (there)
  there\t(there) (tab)
  there\t\t\t\t\t\t\t\t\t\t\t\t(there)
  etc.
 
  Hope that's helpful.  Pick up the book!
 
  M@
 

 Thanks again. I've read a number of tutorials on regular expressions
but
 it's something that I hardly used in the past, so gone far too rusty.

 Before my post, I've tried
 my_expr = re.compile(r'(\w+) \s* (\(\1\))') instead but it doesn't
work,
 so I'm a bit stumped..

 Thanks again,
 Maurice

Maurice,
The reason your regex failed is because you have spaces around the
\s*.  This translates to one space, followed by zero or more
whitespace elements, followed by one space.  So your regex would only
match the two text elements separated by at least 2 spaces.

This kind of demostrates why regular expressions can drive you nuts.

I still suggests picking up the book; not because Jeff Friedl drove a
dump truck full of money up to my door, but because it specifically has
a use case like yours.  So you get to learn  solve your problem at the
same time!

HTH,
M@

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: text processing problem

2005-04-08 Thread Leif K-Brooks

Maurice LING wrote:
I'm looking for a way to do this: I need to scan a text (paragraph or 
so) and look for occurrences of text-x (text-x). That is, if the 
text just before the open bracket is the same as the text in the 
brackets, then I have to delete the brackets, with the text in it.

How's this?
import re
bracket_re = re.compile(r'(.*?)\s*\(\1\)')
def remove_brackets(text):
return bracket_re.sub('\\1', text)
--
http://mail.python.org/mailman/listinfo/python-list

Re: text processing problem

2005-04-07 Thread Matt


Maurice LING wrote:
 Hi,

 I'm looking for a way to do this: I need to scan a text (paragraph or

 so) and look for occurrences of text-x (text-x). That is, if
the
 text just before the open bracket is the same as the text in the
 brackets, then I have to delete the brackets, with the text in it.

 Does anyone knows any way to achieve this?

 The closest I've seen is
 (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by
 Raymond Hettinger

   s = 'People of [planet], take us to your leader.'
   d = dict(planet='Earth')
   print convert_template(s) % d
 People of Earth, take us to your leader.

   s = 'People of planet, take us to your leader.'
   print convert_template(s, '', '') % d
 People of Earth, take us to your leader.

 

 import re

 def convert_template(template, opener='[', closer=']'):
  opener = re.escape(opener)
  closer = re.escape(closer)
  pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' +
closer)
  return re.sub(pattern, r'%(\1)s', template.replace('%','%%'))

 Cheers
 Maurice


Try this:
import re
my_expr = re.compile(r'(\w+) (\(\1\))')
s = this is (is) a test
print my_expr.sub(r'\1', s)
#prints 'this is a test'

M@

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: text processing problem

2005-04-07 Thread Maurice LING

Matt wrote:

Try this:
import re
my_expr = re.compile(r'(\w+) (\(\1\))')
s = this is (is) a test
print my_expr.sub(r'\1', s)
#prints 'this is a test'
M@
Thank you Matt. It works out well. The only think that gives it problem 
is in events as there  (there), where between the word and the same 
bracketted word is more than one whitespaces...

Cheers
Maurice
--
http://mail.python.org/mailman/listinfo/python-list

1 2 >

1 - 100 of 108 matches

Mail list logo