Javier, On Tue, Jul 19, 2011 at 12:18 PM, Javier Andalia <javier_anda...@rapid7.com> wrote: > On 07/19/2011 02:54 PM, Andres Riancho wrote: >> >> Javier, >> >> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia >> <javier_anda...@rapid7.com> wrote: >>> >>> List, >>> >>> This is my attempt to improve the performance of the xpath evaluation >>> given >>> a DOM Element. >>> The original (and current) version is in httpResponse.py. Examples of how >>> this is used can be found at: >>> ajax.py, fileUpload.py, formAutocomplete.py, etc >>> >>> >>> def getDOM2(self): >>> >>> ''' >>> >>> TODO: Put docstring here >>> >>> ''' >>> >>> class DOM(object): >>> >>> def xpath(self, tag, xpathpredicate='.'): >>> >>> xpath = etree.XPath(xpathpredicate) >>> >>> root = etree.fromstring(self.body, >>> >>> etree.HTMLParser(recover=True)) >>> >>> >>> context = etree.iterwalk(root, events=('start',), tag=tag) >>> >>> try: >>> >>> for evt, elem in context: >>> >>> if xpath(elem): >>> >>> yield elem >>> >>> while elem.getprevious() is not None: >>> >>> del elem.getparent()[0] >>> >>> except etree.XPathSyntaxError: >>> >>> om.out.debug('Invalid XPath expression: "%s"' % >>> >>> xpathpredicate) >>> >>> raise >>> >>> del context >> >> Are you sure that this is equivalent to the old implementation? > > What do you mean? It is certainly a little more complex but still > equivalent.
Sorry for not being clear enough! My question was: is your implementation going to return the same result as the old implementation for ALL inputs? >> I'm guessing that the old implementation is faster because it's C >> with a Python wrapper and this is "python calling many times different > > That make sense. Additionally, I think it is slower because the xpath > evaluation occurs *only once* in the original implementation. I definitely > misunderstood what was explained in section "Finding elements quickly" of > [1] where they focus on the use of 'find' and 'findall' vs more efficient > alternatives. We use in our code simple and direct xpath evaluation. Seems > that anything can't be faster than that. > > Javier > > > [1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ > >> C functions" ? Have you tested [0] to see WHERE the CPU is consumed? >> >> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot >> >>> dom = DOM() >>> >>> dom.body = self.body >>> >>> return dom >>> >>> >>> >>> Unfortunately this didn't work out as expected. It is slower. >>> >>>>>> code = ''' >>> >>> f = open("index-form-two-fields.html") >>> >>> html = f.read() >>> >>> f.close() >>> >>> u = url_object('http://w3af.com') >>> >>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type': >>> 'text/html'}, u, u) >>> >>> for i in res.getDOM2().xpath('input', >>> "translate(@type,'PASWORD','pasword')='password'"): >>> >>> pass >>> >>> ''' >>> >>>>>> setup = '''import sys >>> >>> sys.path.append('/home/jandalia/workspace/w3af.unicode'); >>> >>> from core.data.parsers.urlParser import url_object; >>> >>> import core.data.url.httpResponse >>> >>> ''' >>> >>>>>> t = timeit.Timer(code, setup) >>>>>> min(t.repeat(repeat=3, number=10000)) >>> >>> 27.584304094314575 >>> >>> >>> Using the original version: >>> >>>>>> code = ''' >>> >>> f = open("/home/jandalia/Desktop/index-form-two-fields.html") >>> >>> html = f.read() >>> >>> f.close() >>> >>> u = url_object('http://w3af.com') >>> >>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type': >>> 'text/html'}, u, u) >>> >>> dom = res.getDOM() >>> >>> for i in >>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"): >>> >>> pass >>> >>> ''' >>> >>>>>> t = timeit.Timer(code, setup) >>>>>> min(t.repeat(repeat=3, number=10000)) >>> >>> 3.8396580219268799 >>> >>> >>> In other words, it is about 7 times slower. >>> If anyone has an idea on how to improve this code it would be very >>> appreciated. The html doc used for the tests. is attached. >>> >>> Thanks! >>> >>> Javier >>> >>> Note: Some useful info can be found here: >>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Magic Quadrant for Content-Aware Data Loss Prevention >>> Research study explores the data loss prevention market. Includes >>> in-depth >>> analysis on the changes within the DLP market, and the criteria used to >>> evaluate the strengths and weaknesses of these DLP solutions. >>> http://www.accelacomm.com/jaw/sfnl/114/51385063/ >>> _______________________________________________ >>> W3af-develop mailing list >>> W3af-develop@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/w3af-develop >>> >>> >> >> > > -- Andrés Riancho Director of Web Security at Rapid7 LLC Founder at Bonsai Information Security Project Leader at w3af ------------------------------------------------------------------------------ Magic Quadrant for Content-Aware Data Loss Prevention Research study explores the data loss prevention market. Includes in-depth analysis on the changes within the DLP market, and the criteria used to evaluate the strengths and weaknesses of these DLP solutions. http://www.accelacomm.com/jaw/sfnl/114/51385063/ _______________________________________________ W3af-develop mailing list W3af-develop@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/w3af-develop