On 07/19/2011 04:29 PM, Andres Riancho wrote:
> Javier,
>
> On Tue, Jul 19, 2011 at 12:18 PM, Javier Andalia
> <javier_anda...@rapid7.com>  wrote:
>> On 07/19/2011 02:54 PM, Andres Riancho wrote:
>>> Javier,
>>>
>>> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia
>>> <javier_anda...@rapid7.com>    wrote:
>>>> List,
>>>>
>>>> This is my attempt to improve the performance of the xpath evaluation
>>>> given
>>>> a DOM Element.
>>>> The original (and current) version is in httpResponse.py. Examples of how
>>>> this is used can be found at:
>>>> ajax.py, fileUpload.py, formAutocomplete.py, etc
>>>>
>>>>
>>>>     def getDOM2(self):
>>>>
>>>>         '''
>>>>
>>>>         TODO: Put docstring here
>>>>
>>>>         '''
>>>>
>>>>         class DOM(object):
>>>>
>>>>             def xpath(self, tag, xpathpredicate='.'):
>>>>
>>>>                 xpath = etree.XPath(xpathpredicate)
>>>>
>>>>                 root = etree.fromstring(self.body,
>>>>
>>>>                                         etree.HTMLParser(recover=True))
>>>>
>>>>
>>>>                 context = etree.iterwalk(root, events=('start',), tag=tag)
>>>>
>>>>                 try:
>>>>
>>>>                     for evt, elem in context:
>>>>
>>>>                         if xpath(elem):
>>>>
>>>>                             yield elem
>>>>
>>>>                         while elem.getprevious() is not None:
>>>>
>>>>                             del elem.getparent()[0]
>>>>
>>>>                 except etree.XPathSyntaxError:
>>>>
>>>>                         om.out.debug('Invalid XPath expression: "%s"' %
>>>>
>>>>                                      xpathpredicate)
>>>>
>>>>                         raise
>>>>
>>>>                 del context
>>>      Are you sure that this is equivalent to the old implementation?
>> What do you mean? It is certainly a little more complex but still
>> equivalent.
> Sorry for not being clear enough! My question was: is your
> implementation going to return the same result as the old
> implementation for ALL inputs?
>

Pretty sure! Note there's a slight variation on the way  the 'xpath' 
method is called in the experimental implementation though.

typical lines as:

dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']")


were converted to:

dom.xpath(tag='input',
          xpathpredicate="translate(@type,'PASWORD','pasword')='password'")


>>>      I'm guessing that the old implementation is faster because it's C
>>> with a Python wrapper and this is "python calling many times different
>> That make sense. Additionally, I think it is slower because the xpath
>> evaluation occurs *only once* in the original implementation. I definitely
>> misunderstood what was explained in section "Finding elements quickly" of
>> [1] where they focus on the use of 'find' and 'findall' vs more efficient
>> alternatives. We use in our code simple and direct xpath evaluation. Seems
>> that anything can't be faster than that.
>>
>> Javier
>>
>>
>> [1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>
>>> C functions" ? Have you tested [0] to see WHERE the CPU is consumed?
>>>
>>> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
>>>
>>>>         dom = DOM()
>>>>
>>>>         dom.body = self.body
>>>>
>>>>         return dom
>>>>
>>>>
>>>>
>>>> Unfortunately this didn't work out as expected. It is slower.
>>>>
>>>>>>>   code = '''
>>>> f = open("index-form-two-fields.html")
>>>>
>>>> html = f.read()
>>>>
>>>> f.close()
>>>>
>>>> u = url_object('http://w3af.com')
>>>>
>>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>>>> 'text/html'}, u, u)
>>>>
>>>> for i in res.getDOM2().xpath('input',
>>>> "translate(@type,'PASWORD','pasword')='password'"):
>>>>
>>>>     pass
>>>>
>>>> '''
>>>>
>>>>>>>   setup = '''import sys
>>>> sys.path.append('/home/jandalia/workspace/w3af.unicode');
>>>>
>>>> from core.data.parsers.urlParser import url_object;
>>>>
>>>> import core.data.url.httpResponse
>>>>
>>>> '''
>>>>
>>>>>>>   t = timeit.Timer(code, setup)
>>>>>>>   min(t.repeat(repeat=3, number=10000))
>>>> 27.584304094314575
>>>>
>>>>
>>>> Using the original version:
>>>>
>>>>>>>   code = '''
>>>> f = open("/home/jandalia/Desktop/index-form-two-fields.html")
>>>>
>>>> html = f.read()
>>>>
>>>> f.close()
>>>>
>>>> u = url_object('http://w3af.com')
>>>>
>>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>>>> 'text/html'}, u, u)
>>>>
>>>> dom = res.getDOM()
>>>>
>>>> for i in
>>>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"):
>>>>
>>>>     pass
>>>>
>>>> '''
>>>>
>>>>>>>   t = timeit.Timer(code, setup)
>>>>>>>   min(t.repeat(repeat=3, number=10000))
>>>> 3.8396580219268799
>>>>
>>>>
>>>> In other words, it is about 7 times slower.
>>>> If anyone has an idea on how to improve this code it would be very
>>>> appreciated. The html doc used for the tests. is attached.
>>>>
>>>> Thanks!
>>>>
>>>> Javier
>>>>
>>>> Note: Some useful info can be found here:
>>>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Magic Quadrant for Content-Aware Data Loss Prevention
>>>> Research study explores the data loss prevention market. Includes
>>>> in-depth
>>>> analysis on the changes within the DLP market, and the criteria used to
>>>> evaluate the strengths and weaknesses of these DLP solutions.
>>>> http://www.accelacomm.com/jaw/sfnl/114/51385063/
>>>> _______________________________________________
>>>> W3af-develop mailing list
>>>> W3af-develop@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/w3af-develop
>>>>
>>>>
>>>
>>
>
>


------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Reply via email to