Re: [Tutor] man pages parsing (still)

2006-09-12 Thread Tiago Saboga
Em Segunda 11 Setembro 2006 19:45, Kent Johnson escreveu:
 Tiago Saboga wrote:
  Ok, the guilty line (279) has a copy; that was probably defined in the
  dtd, but as it doesn't know what is the right dtd... But wait... How does
  python read the dtd? It fetches it from the net? I tried it
  (disconnected) and the answer is yes, it fetches it from the net. So
  that's the problem!
 
  But how do I avoid it? I'll search. But if you can spare me some time,
  you'll make me a little happier.
 
  [1] - The line is as follows:
  !DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN

  http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd;

 I'm just guessing, but I think if you find the right combination of
 handlers and feature settings you can at least make it just pass through
 the external entities without looking up the DTDs.

I got it! I just set the feature_external_ges to false and it doesn't fetch 
the dtd any more. Thanks!!! ;-)


 Take a look at these pages for some hints:
 http://www.cafeconleche.org/books/xmljava/chapters/ch07s02.html#d0e10350
 http://www.cafeconleche.org/books/xmljava/chapters/ch06s11.html

It looks very interesting, and it was exactly what I needed. But I couldn't 
grab it at first, I need some more time to understand it all.

Thanks again!!!

Tiago.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] man pages parsing (still)

2006-09-11 Thread Tiago Saboga
I'm still there, trying to parse man pages (I want to gather a list of all 
options with their help strings). I've tried to use regex on both the 
formatted output of man and the source troff files and I discovered what is 
already said in the doclifter man page: you have to do a number of hints, and 
it's really not simple. So I'm know using doclifter, and it's working, but is 
terribly slow. Doclifter itself take around a second to parse the troff file, 
but my few lines of code take 25 seconds to parse the resultant xml. I've 
pasted the code at http://pastebin.ca/166941
and I'd like to hear from you how I could possibly optimize it.

Thanks,

Tiago.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Kent Johnson
Tiago Saboga wrote:
 I'm still there, trying to parse man pages (I want to gather a list of all 
 options with their help strings). I've tried to use regex on both the 
 formatted output of man and the source troff files and I discovered what is 
 already said in the doclifter man page: you have to do a number of hints, and 
 it's really not simple. So I'm know using doclifter, and it's working, but is 
 terribly slow. Doclifter itself take around a second to parse the troff file, 
 but my few lines of code take 25 seconds to parse the resultant xml. I've 
 pasted the code at http://pastebin.ca/166941
 and I'd like to hear from you how I could possibly optimize it.

How big is the XML? 25 seconds is a long time...I would look at 
cElementTree (implementation of ElementTree in C), it is pretty fast.
http://effbot.org/zone/celementtree.htm

In particular iterparse() might be helpful:
http://effbot.org/zone/element-iterparse.htm

I would also try specifying a buffer size in the call to os.popen2(), if 
the I/O is unbuffered or the buffer is small that might be the bottleneck.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Tiago Saboga
Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
 Tiago Saboga wrote:
  I'm still there, trying to parse man pages (I want to gather a list of
  all options with their help strings). I've tried to use regex on both the
  formatted output of man and the source troff files and I discovered what
  is already said in the doclifter man page: you have to do a number of
  hints, and it's really not simple. So I'm know using doclifter, and it's
  working, but is terribly slow. Doclifter itself take around a second to
  parse the troff file, but my few lines of code take 25 seconds to parse
  the resultant xml. I've pasted the code at http://pastebin.ca/166941
  and I'd like to hear from you how I could possibly optimize it.

 How big is the XML? 25 seconds is a long time...I would look at
 cElementTree (implementation of ElementTree in C), it is pretty fast.
 http://effbot.org/zone/celementtree.htm

It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of 
course, if it's the only solution... 25 (28, in fact, for the cp man page) 
isn't really acceptable.

 In particular iterparse() might be helpful:
 http://effbot.org/zone/element-iterparse.htm

Ok, I'll look that.

 I would also try specifying a buffer size in the call to os.popen2(), if
 the I/O is unbuffered or the buffer is small that might be the bottleneck.

What's appropriate in that case? I really don't understand how I should 
determine a buffer size. Any pointers?

Thanks,

Tiago.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Kent Johnson
Tiago Saboga wrote:
 Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
 Tiago Saboga wrote:
 How big is the XML? 25 seconds is a long time...I would look at
 cElementTree (implementation of ElementTree in C), it is pretty fast.
 http://effbot.org/zone/celementtree.htm
 
 It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of 
 course, if it's the only solution... 25 (28, in fact, for the cp man page) 
 isn't really acceptable.

That's tiny! No way it should take 25 seconds to parse a 10k file.

Have you tried saving the file separately and parsing from disk? That 
would help determine if the interprocess pipe is the problem.
 
 I would also try specifying a buffer size in the call to os.popen2(), if
 the I/O is unbuffered or the buffer is small that might be the bottleneck.
 
 What's appropriate in that case? I really don't understand how I should 
 determine a buffer size. Any pointers?

To tell the truth I don't use popen myself so if anyone else wants to 
chime in that would be fine...but I would try maybe 1024 or 10240 (10k).

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Tiago Saboga
Em Segunda 11 Setembro 2006 12:24, Kent Johnson escreveu:
 Tiago Saboga wrote:
  Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
  Tiago Saboga wrote:
  How big is the XML? 25 seconds is a long time...I would look at
  cElementTree (implementation of ElementTree in C), it is pretty fast.
  http://effbot.org/zone/celementtree.htm
 
  It's about 10k. Hey, it seems easy, but I'd like not to start over again.
  Of course, if it's the only solution... 25 (28, in fact, for the cp man
  page) isn't really acceptable.

 That's tiny! No way it should take 25 seconds to parse a 10k file.

 Have you tried saving the file separately and parsing from disk? That
 would help determine if the interprocess pipe is the problem.

Just tried, and - incredible - it took even longer: 46s. But in the second run 
it came back to 25s. I really don't understand what's going on. I did some 
other tests, and I found that all the code before parser.parse(stout) runs 
almost instantly; it then takes all the running somewhere between this call 
and the first event; and the rest is almost instantly again. Any ideas?

By the way, I've read the pages you indicated at effbot, but I don't see where 
to begin. Do you know of a gentler introduction to this module 
(cElementTree)? 

Thanks,

Tiago.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Tiago Saboga
Em Segunda 11 Setembro 2006 12:59, Kent Johnson escreveu:
 Tiago Saboga wrote:
  Em Segunda 11 Setembro 2006 12:24, Kent Johnson escreveu:
  Tiago Saboga wrote:
  Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
  Tiago Saboga wrote:
  How big is the XML? 25 seconds is a long time...I would look at
  cElementTree (implementation of ElementTree in C), it is pretty fast.
  http://effbot.org/zone/celementtree.htm
 
  It's about 10k. Hey, it seems easy, but I'd like not to start over
  again. Of course, if it's the only solution... 25 (28, in fact, for the
  cp man page) isn't really acceptable.
 
  That's tiny! No way it should take 25 seconds to parse a 10k file.
 
  Have you tried saving the file separately and parsing from disk? That
  would help determine if the interprocess pipe is the problem.
 
  Just tried, and - incredible - it took even longer: 46s. But in the
  second run it came back to 25s. I really don't understand what's going
  on. I did some other tests, and I found that all the code before
  parser.parse(stout) runs almost instantly; it then takes all the
  running somewhere between this call and the first event; and the rest is
  almost instantly again. Any ideas?

 What did you try, buffering or reading from a file? If parsing from a
 file takes 25 secs, I am amazed...

I read from a file, and before you ask, no, I'm not working in a 286 and 
compiling my kernel at the same time... ;-)

In fact, I decided to strip down both my code and the xml file. I've stripped 
the code to almost nothing, having yet a 23s time. And the same with the xml 
file... until I cut out the second line, with the dtd [1]. And surprise: I've 
a nice time. So I put it all together again, but have the following caveat: 
there's an error that did not raise previously:]

Traceback (most recent call last):
  File ./liftopy.py, line 130, in ?
parser.parse(stout)
  File /usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py, line 
109, in parse
xmlreader.IncrementalParser.parse(self, source)
  File /usr/lib/python2.3/site-packages/_xmlplus/sax/xmlreader.py, line 123, 
in parse
self.feed(buffer)
  File /usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py, line 
220, in feed
self._err_handler.fatalError(exc)
  File /usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py, line 38, in 
fatalError
raise exception
xml.sax._exceptions.SAXParseException: 
/home/tiago/Computador/python/opy/manraw/doclift/cp.1.xml.stripped:279:16: 
undefined entity

Ok, the guilty line (279) has a copy; that was probably defined in the dtd, 
but as it doesn't know what is the right dtd... But wait... How does python 
read the dtd? It fetches it from the net? I tried it (disconnected) and the 
answer is yes, it fetches it from the net. So that's the problem!

But how do I avoid it? I'll search. But if you can spare me some time, you'll 
make me a little happier. 

[1] - The line is as follows:
!DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN
   http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd;

Thanks!

Tiago.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Kent Johnson
Tiago Saboga wrote:
 Ok, the guilty line (279) has a copy; that was probably defined in the 
 dtd, 
 but as it doesn't know what is the right dtd... But wait... How does python 
 read the dtd? It fetches it from the net? I tried it (disconnected) and the 
 answer is yes, it fetches it from the net. So that's the problem!
 
 But how do I avoid it? I'll search. But if you can spare me some time, you'll 
 make me a little happier. 
 
 [1] - The line is as follows:
 !DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd;

I'm just guessing, but I think if you find the right combination of 
handlers and feature settings you can at least make it just pass through 
the external entities without looking up the DTDs.

Take a look at these pages for some hints:
http://www.cafeconleche.org/books/xmljava/chapters/ch07s02.html#d0e10350
http://www.cafeconleche.org/books/xmljava/chapters/ch06s11.html

They are talking about Java but the SAX interface is a cross-language 
standard so the names and semantics should be the same.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Danny Yoo
 terribly slow. Doclifter itself take around a second to parse the troff 
 file, but my few lines of code take 25 seconds to parse the resultant 
 xml. I've pasted the code at http://pastebin.ca/166941 and I'd like to 
 hear from you how I could possibly optimize it.

Hi Tiago,

Before we go any further: have you run your program through the Python 
profiler yet?

Take a look at:

 http://docs.python.org/lib/profile.html

and see if that can help isolate the slow sections in your program.



If I really had to guess, without profiling information, I'd take a very 
close look at the characters() method: it's doing some string 
concatentation there that may have very bad performance, depending on the 
input.  See:

 http://mail.python.org/pipermail/tutor/2004-August/031568.html

and the thread around that time for details on why string concatentation 
should be treated carefully.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Kent Johnson
Danny Yoo wrote:

 If I really had to guess, without profiling information, I'd take a very 
 close look at the characters() method: it's doing some string 
 concatentation there that may have very bad performance, depending on the 
 input.  See:
 
  http://mail.python.org/pipermail/tutor/2004-August/031568.html
 
 and the thread around that time for details on why string concatentation 
 should be treated carefully.

Gee, Danny, it's hard to disagree with you when you quote me in support 
of your argument, but...the characters() method is probably called only 
once or twice per tag, and the string is reinitialized for each tag. So 
this seems unlikely to be the culprit.

Course it helps that I have read to the end of the thread - the problem 
seems to be accessing the external DTD ;-)

By the way that article of mine is obsoleted by Python 2.4, which 
optimizes string concatenation in a loop...
http://www.python.org/doc/2.4.3/whatsnew/node12.html#SECTION000121

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] man pages parsing (still)

2006-09-11 Thread Danny Yoo


 Gee, Danny, it's hard to disagree with you when you quote me in support 
 of your argument, but...the characters() method is probably called only 
 once or twice per tag, and the string is reinitialized for each tag. So 
 this seems unlikely to be the culprit.

Ah, didn't see those; that'll teach me not to guess.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor