On 07/19/2011 04:29 PM, Andres Riancho wrote: > Javier, > > On Tue, Jul 19, 2011 at 12:18 PM, Javier Andalia > <javier_anda...@rapid7.com> wrote: >> On 07/19/2011 02:54 PM, Andres Riancho wrote: >>> Javier, >>> >>> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia >>> <javier_anda...@rapid7.com> wrote: >>>> List, >>>> >>>> This is my attempt to improve the performance of the xpath evaluation >>>> given >>>> a DOM Element. >>>> The original (and current) version is in httpResponse.py. Examples of how >>>> this is used can be found at: >>>> ajax.py, fileUpload.py, formAutocomplete.py, etc >>>> >>>> >>>> def getDOM2(self): >>>> >>>> ''' >>>> >>>> TODO: Put docstring here >>>> >>>> ''' >>>> >>>> class DOM(object): >>>> >>>> def xpath(self, tag, xpathpredicate='.'): >>>> >>>> xpath = etree.XPath(xpathpredicate) >>>> >>>> root = etree.fromstring(self.body, >>>> >>>> etree.HTMLParser(recover=True)) >>>> >>>> >>>> context = etree.iterwalk(root, events=('start',), tag=tag) >>>> >>>> try: >>>> >>>> for evt, elem in context: >>>> >>>> if xpath(elem): >>>> >>>> yield elem >>>> >>>> while elem.getprevious() is not None: >>>> >>>> del elem.getparent()[0] >>>> >>>> except etree.XPathSyntaxError: >>>> >>>> om.out.debug('Invalid XPath expression: "%s"' % >>>> >>>> xpathpredicate) >>>> >>>> raise >>>> >>>> del context >>> Are you sure that this is equivalent to the old implementation? >> What do you mean? It is certainly a little more complex but still >> equivalent. > Sorry for not being clear enough! My question was: is your > implementation going to return the same result as the old > implementation for ALL inputs? >
Pretty sure! Note there's a slight variation on the way the 'xpath' method is called in the experimental implementation though. typical lines as: dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']") were converted to: dom.xpath(tag='input', xpathpredicate="translate(@type,'PASWORD','pasword')='password'") >>> I'm guessing that the old implementation is faster because it's C >>> with a Python wrapper and this is "python calling many times different >> That make sense. Additionally, I think it is slower because the xpath >> evaluation occurs *only once* in the original implementation. I definitely >> misunderstood what was explained in section "Finding elements quickly" of >> [1] where they focus on the use of 'find' and 'findall' vs more efficient >> alternatives. We use in our code simple and direct xpath evaluation. Seems >> that anything can't be faster than that. >> >> Javier >> >> >> [1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ >> >>> C functions" ? Have you tested [0] to see WHERE the CPU is consumed? >>> >>> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot >>> >>>> dom = DOM() >>>> >>>> dom.body = self.body >>>> >>>> return dom >>>> >>>> >>>> >>>> Unfortunately this didn't work out as expected. It is slower. >>>> >>>>>>> code = ''' >>>> f = open("index-form-two-fields.html") >>>> >>>> html = f.read() >>>> >>>> f.close() >>>> >>>> u = url_object('http://w3af.com') >>>> >>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type': >>>> 'text/html'}, u, u) >>>> >>>> for i in res.getDOM2().xpath('input', >>>> "translate(@type,'PASWORD','pasword')='password'"): >>>> >>>> pass >>>> >>>> ''' >>>> >>>>>>> setup = '''import sys >>>> sys.path.append('/home/jandalia/workspace/w3af.unicode'); >>>> >>>> from core.data.parsers.urlParser import url_object; >>>> >>>> import core.data.url.httpResponse >>>> >>>> ''' >>>> >>>>>>> t = timeit.Timer(code, setup) >>>>>>> min(t.repeat(repeat=3, number=10000)) >>>> 27.584304094314575 >>>> >>>> >>>> Using the original version: >>>> >>>>>>> code = ''' >>>> f = open("/home/jandalia/Desktop/index-form-two-fields.html") >>>> >>>> html = f.read() >>>> >>>> f.close() >>>> >>>> u = url_object('http://w3af.com') >>>> >>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type': >>>> 'text/html'}, u, u) >>>> >>>> dom = res.getDOM() >>>> >>>> for i in >>>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"): >>>> >>>> pass >>>> >>>> ''' >>>> >>>>>>> t = timeit.Timer(code, setup) >>>>>>> min(t.repeat(repeat=3, number=10000)) >>>> 3.8396580219268799 >>>> >>>> >>>> In other words, it is about 7 times slower. >>>> If anyone has an idea on how to improve this code it would be very >>>> appreciated. The html doc used for the tests. is attached. >>>> >>>> Thanks! >>>> >>>> Javier >>>> >>>> Note: Some useful info can be found here: >>>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Magic Quadrant for Content-Aware Data Loss Prevention >>>> Research study explores the data loss prevention market. Includes >>>> in-depth >>>> analysis on the changes within the DLP market, and the criteria used to >>>> evaluate the strengths and weaknesses of these DLP solutions. >>>> http://www.accelacomm.com/jaw/sfnl/114/51385063/ >>>> _______________________________________________ >>>> W3af-develop mailing list >>>> W3af-develop@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/w3af-develop >>>> >>>> >>> >> > > ------------------------------------------------------------------------------ Magic Quadrant for Content-Aware Data Loss Prevention Research study explores the data loss prevention market. Includes in-depth analysis on the changes within the DLP market, and the criteria used to evaluate the strengths and weaknesses of these DLP solutions. http://www.accelacomm.com/jaw/sfnl/114/51385063/ _______________________________________________ W3af-develop mailing list W3af-develop@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/w3af-develop