Re: [Tutor] man pages parsing (still)
Em Segunda 11 Setembro 2006 19:45, Kent Johnson escreveu: Tiago Saboga wrote: Ok, the guilty line (279) has a copy; that was probably defined in the dtd, but as it doesn't know what is the right dtd... But wait... How does python read the dtd? It fetches it from the net? I tried it (disconnected) and the answer is yes, it fetches it from the net. So that's the problem! But how do I avoid it? I'll search. But if you can spare me some time, you'll make me a little happier. [1] - The line is as follows: !DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd; I'm just guessing, but I think if you find the right combination of handlers and feature settings you can at least make it just pass through the external entities without looking up the DTDs. I got it! I just set the feature_external_ges to false and it doesn't fetch the dtd any more. Thanks!!! ;-) Take a look at these pages for some hints: http://www.cafeconleche.org/books/xmljava/chapters/ch07s02.html#d0e10350 http://www.cafeconleche.org/books/xmljava/chapters/ch06s11.html It looks very interesting, and it was exactly what I needed. But I couldn't grab it at first, I need some more time to understand it all. Thanks again!!! Tiago. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] man pages parsing (still)
I'm still there, trying to parse man pages (I want to gather a list of all options with their help strings). I've tried to use regex on both the formatted output of man and the source troff files and I discovered what is already said in the doclifter man page: you have to do a number of hints, and it's really not simple. So I'm know using doclifter, and it's working, but is terribly slow. Doclifter itself take around a second to parse the troff file, but my few lines of code take 25 seconds to parse the resultant xml. I've pasted the code at http://pastebin.ca/166941 and I'd like to hear from you how I could possibly optimize it. Thanks, Tiago. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Tiago Saboga wrote: I'm still there, trying to parse man pages (I want to gather a list of all options with their help strings). I've tried to use regex on both the formatted output of man and the source troff files and I discovered what is already said in the doclifter man page: you have to do a number of hints, and it's really not simple. So I'm know using doclifter, and it's working, but is terribly slow. Doclifter itself take around a second to parse the troff file, but my few lines of code take 25 seconds to parse the resultant xml. I've pasted the code at http://pastebin.ca/166941 and I'd like to hear from you how I could possibly optimize it. How big is the XML? 25 seconds is a long time...I would look at cElementTree (implementation of ElementTree in C), it is pretty fast. http://effbot.org/zone/celementtree.htm In particular iterparse() might be helpful: http://effbot.org/zone/element-iterparse.htm I would also try specifying a buffer size in the call to os.popen2(), if the I/O is unbuffered or the buffer is small that might be the bottleneck. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu: Tiago Saboga wrote: I'm still there, trying to parse man pages (I want to gather a list of all options with their help strings). I've tried to use regex on both the formatted output of man and the source troff files and I discovered what is already said in the doclifter man page: you have to do a number of hints, and it's really not simple. So I'm know using doclifter, and it's working, but is terribly slow. Doclifter itself take around a second to parse the troff file, but my few lines of code take 25 seconds to parse the resultant xml. I've pasted the code at http://pastebin.ca/166941 and I'd like to hear from you how I could possibly optimize it. How big is the XML? 25 seconds is a long time...I would look at cElementTree (implementation of ElementTree in C), it is pretty fast. http://effbot.org/zone/celementtree.htm It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of course, if it's the only solution... 25 (28, in fact, for the cp man page) isn't really acceptable. In particular iterparse() might be helpful: http://effbot.org/zone/element-iterparse.htm Ok, I'll look that. I would also try specifying a buffer size in the call to os.popen2(), if the I/O is unbuffered or the buffer is small that might be the bottleneck. What's appropriate in that case? I really don't understand how I should determine a buffer size. Any pointers? Thanks, Tiago. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Tiago Saboga wrote: Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu: Tiago Saboga wrote: How big is the XML? 25 seconds is a long time...I would look at cElementTree (implementation of ElementTree in C), it is pretty fast. http://effbot.org/zone/celementtree.htm It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of course, if it's the only solution... 25 (28, in fact, for the cp man page) isn't really acceptable. That's tiny! No way it should take 25 seconds to parse a 10k file. Have you tried saving the file separately and parsing from disk? That would help determine if the interprocess pipe is the problem. I would also try specifying a buffer size in the call to os.popen2(), if the I/O is unbuffered or the buffer is small that might be the bottleneck. What's appropriate in that case? I really don't understand how I should determine a buffer size. Any pointers? To tell the truth I don't use popen myself so if anyone else wants to chime in that would be fine...but I would try maybe 1024 or 10240 (10k). Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Em Segunda 11 Setembro 2006 12:24, Kent Johnson escreveu: Tiago Saboga wrote: Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu: Tiago Saboga wrote: How big is the XML? 25 seconds is a long time...I would look at cElementTree (implementation of ElementTree in C), it is pretty fast. http://effbot.org/zone/celementtree.htm It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of course, if it's the only solution... 25 (28, in fact, for the cp man page) isn't really acceptable. That's tiny! No way it should take 25 seconds to parse a 10k file. Have you tried saving the file separately and parsing from disk? That would help determine if the interprocess pipe is the problem. Just tried, and - incredible - it took even longer: 46s. But in the second run it came back to 25s. I really don't understand what's going on. I did some other tests, and I found that all the code before parser.parse(stout) runs almost instantly; it then takes all the running somewhere between this call and the first event; and the rest is almost instantly again. Any ideas? By the way, I've read the pages you indicated at effbot, but I don't see where to begin. Do you know of a gentler introduction to this module (cElementTree)? Thanks, Tiago. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Em Segunda 11 Setembro 2006 12:59, Kent Johnson escreveu: Tiago Saboga wrote: Em Segunda 11 Setembro 2006 12:24, Kent Johnson escreveu: Tiago Saboga wrote: Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu: Tiago Saboga wrote: How big is the XML? 25 seconds is a long time...I would look at cElementTree (implementation of ElementTree in C), it is pretty fast. http://effbot.org/zone/celementtree.htm It's about 10k. Hey, it seems easy, but I'd like not to start over again. Of course, if it's the only solution... 25 (28, in fact, for the cp man page) isn't really acceptable. That's tiny! No way it should take 25 seconds to parse a 10k file. Have you tried saving the file separately and parsing from disk? That would help determine if the interprocess pipe is the problem. Just tried, and - incredible - it took even longer: 46s. But in the second run it came back to 25s. I really don't understand what's going on. I did some other tests, and I found that all the code before parser.parse(stout) runs almost instantly; it then takes all the running somewhere between this call and the first event; and the rest is almost instantly again. Any ideas? What did you try, buffering or reading from a file? If parsing from a file takes 25 secs, I am amazed... I read from a file, and before you ask, no, I'm not working in a 286 and compiling my kernel at the same time... ;-) In fact, I decided to strip down both my code and the xml file. I've stripped the code to almost nothing, having yet a 23s time. And the same with the xml file... until I cut out the second line, with the dtd [1]. And surprise: I've a nice time. So I put it all together again, but have the following caveat: there's an error that did not raise previously:] Traceback (most recent call last): File ./liftopy.py, line 130, in ? parser.parse(stout) File /usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py, line 109, in parse xmlreader.IncrementalParser.parse(self, source) File /usr/lib/python2.3/site-packages/_xmlplus/sax/xmlreader.py, line 123, in parse self.feed(buffer) File /usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py, line 220, in feed self._err_handler.fatalError(exc) File /usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py, line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: /home/tiago/Computador/python/opy/manraw/doclift/cp.1.xml.stripped:279:16: undefined entity Ok, the guilty line (279) has a copy; that was probably defined in the dtd, but as it doesn't know what is the right dtd... But wait... How does python read the dtd? It fetches it from the net? I tried it (disconnected) and the answer is yes, it fetches it from the net. So that's the problem! But how do I avoid it? I'll search. But if you can spare me some time, you'll make me a little happier. [1] - The line is as follows: !DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd; Thanks! Tiago. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Tiago Saboga wrote: Ok, the guilty line (279) has a copy; that was probably defined in the dtd, but as it doesn't know what is the right dtd... But wait... How does python read the dtd? It fetches it from the net? I tried it (disconnected) and the answer is yes, it fetches it from the net. So that's the problem! But how do I avoid it? I'll search. But if you can spare me some time, you'll make me a little happier. [1] - The line is as follows: !DOCTYPE refentry PUBLIC -//OASIS//DTD DocBook XML V4.1.2//EN http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd; I'm just guessing, but I think if you find the right combination of handlers and feature settings you can at least make it just pass through the external entities without looking up the DTDs. Take a look at these pages for some hints: http://www.cafeconleche.org/books/xmljava/chapters/ch07s02.html#d0e10350 http://www.cafeconleche.org/books/xmljava/chapters/ch06s11.html They are talking about Java but the SAX interface is a cross-language standard so the names and semantics should be the same. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
terribly slow. Doclifter itself take around a second to parse the troff file, but my few lines of code take 25 seconds to parse the resultant xml. I've pasted the code at http://pastebin.ca/166941 and I'd like to hear from you how I could possibly optimize it. Hi Tiago, Before we go any further: have you run your program through the Python profiler yet? Take a look at: http://docs.python.org/lib/profile.html and see if that can help isolate the slow sections in your program. If I really had to guess, without profiling information, I'd take a very close look at the characters() method: it's doing some string concatentation there that may have very bad performance, depending on the input. See: http://mail.python.org/pipermail/tutor/2004-August/031568.html and the thread around that time for details on why string concatentation should be treated carefully. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Danny Yoo wrote: If I really had to guess, without profiling information, I'd take a very close look at the characters() method: it's doing some string concatentation there that may have very bad performance, depending on the input. See: http://mail.python.org/pipermail/tutor/2004-August/031568.html and the thread around that time for details on why string concatentation should be treated carefully. Gee, Danny, it's hard to disagree with you when you quote me in support of your argument, but...the characters() method is probably called only once or twice per tag, and the string is reinitialized for each tag. So this seems unlikely to be the culprit. Course it helps that I have read to the end of the thread - the problem seems to be accessing the external DTD ;-) By the way that article of mine is obsoleted by Python 2.4, which optimizes string concatenation in a loop... http://www.python.org/doc/2.4.3/whatsnew/node12.html#SECTION000121 Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] man pages parsing (still)
Gee, Danny, it's hard to disagree with you when you quote me in support of your argument, but...the characters() method is probably called only once or twice per tag, and the string is reinitialized for each tag. So this seems unlikely to be the culprit. Ah, didn't see those; that'll teach me not to guess. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor