Javier,

On Tue, Jul 19, 2011 at 12:18 PM, Javier Andalia
<javier_anda...@rapid7.com> wrote:
> On 07/19/2011 02:54 PM, Andres Riancho wrote:
>>
>> Javier,
>>
>> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia
>> <javier_anda...@rapid7.com>  wrote:
>>>
>>> List,
>>>
>>> This is my attempt to improve the performance of the xpath evaluation
>>> given
>>> a DOM Element.
>>> The original (and current) version is in httpResponse.py. Examples of how
>>> this is used can be found at:
>>> ajax.py, fileUpload.py, formAutocomplete.py, etc
>>>
>>>
>>>    def getDOM2(self):
>>>
>>>        '''
>>>
>>>        TODO: Put docstring here
>>>
>>>        '''
>>>
>>>        class DOM(object):
>>>
>>>            def xpath(self, tag, xpathpredicate='.'):
>>>
>>>                xpath = etree.XPath(xpathpredicate)
>>>
>>>                root = etree.fromstring(self.body,
>>>
>>>                                        etree.HTMLParser(recover=True))
>>>
>>>
>>>                context = etree.iterwalk(root, events=('start',), tag=tag)
>>>
>>>                try:
>>>
>>>                    for evt, elem in context:
>>>
>>>                        if xpath(elem):
>>>
>>>                            yield elem
>>>
>>>                        while elem.getprevious() is not None:
>>>
>>>                            del elem.getparent()[0]
>>>
>>>                except etree.XPathSyntaxError:
>>>
>>>                        om.out.debug('Invalid XPath expression: "%s"' %
>>>
>>>                                     xpathpredicate)
>>>
>>>                        raise
>>>
>>>                del context
>>
>>     Are you sure that this is equivalent to the old implementation?
>
> What do you mean? It is certainly a little more complex but still
> equivalent.

Sorry for not being clear enough! My question was: is your
implementation going to return the same result as the old
implementation for ALL inputs?

>>     I'm guessing that the old implementation is faster because it's C
>> with a Python wrapper and this is "python calling many times different
>
> That make sense. Additionally, I think it is slower because the xpath
> evaluation occurs *only once* in the original implementation. I definitely
> misunderstood what was explained in section "Finding elements quickly" of
> [1] where they focus on the use of 'find' and 'findall' vs more efficient
> alternatives. We use in our code simple and direct xpath evaluation. Seems
> that anything can't be faster than that.
>
> Javier
>
>
> [1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>
>> C functions" ? Have you tested [0] to see WHERE the CPU is consumed?
>>
>> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
>>
>>>        dom = DOM()
>>>
>>>        dom.body = self.body
>>>
>>>        return dom
>>>
>>>
>>>
>>> Unfortunately this didn't work out as expected. It is slower.
>>>
>>>>>>  code = '''
>>>
>>> f = open("index-form-two-fields.html")
>>>
>>> html = f.read()
>>>
>>> f.close()
>>>
>>> u = url_object('http://w3af.com')
>>>
>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>>> 'text/html'}, u, u)
>>>
>>> for i in res.getDOM2().xpath('input',
>>> "translate(@type,'PASWORD','pasword')='password'"):
>>>
>>>    pass
>>>
>>> '''
>>>
>>>>>>  setup = '''import sys
>>>
>>> sys.path.append('/home/jandalia/workspace/w3af.unicode');
>>>
>>> from core.data.parsers.urlParser import url_object;
>>>
>>> import core.data.url.httpResponse
>>>
>>> '''
>>>
>>>>>>  t = timeit.Timer(code, setup)
>>>>>>  min(t.repeat(repeat=3, number=10000))
>>>
>>> 27.584304094314575
>>>
>>>
>>> Using the original version:
>>>
>>>>>>  code = '''
>>>
>>> f = open("/home/jandalia/Desktop/index-form-two-fields.html")
>>>
>>> html = f.read()
>>>
>>> f.close()
>>>
>>> u = url_object('http://w3af.com')
>>>
>>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>>> 'text/html'}, u, u)
>>>
>>> dom = res.getDOM()
>>>
>>> for i in
>>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"):
>>>
>>>    pass
>>>
>>> '''
>>>
>>>>>>  t = timeit.Timer(code, setup)
>>>>>>  min(t.repeat(repeat=3, number=10000))
>>>
>>> 3.8396580219268799
>>>
>>>
>>> In other words, it is about 7 times slower.
>>> If anyone has an idea on how to improve this code it would be very
>>> appreciated. The html doc used for the tests. is attached.
>>>
>>> Thanks!
>>>
>>> Javier
>>>
>>> Note: Some useful info can be found here:
>>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Magic Quadrant for Content-Aware Data Loss Prevention
>>> Research study explores the data loss prevention market. Includes
>>> in-depth
>>> analysis on the changes within the DLP market, and the criteria used to
>>> evaluate the strengths and weaknesses of these DLP solutions.
>>> http://www.accelacomm.com/jaw/sfnl/114/51385063/
>>> _______________________________________________
>>> W3af-develop mailing list
>>> W3af-develop@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/w3af-develop
>>>
>>>
>>
>>
>
>



-- 
Andrés Riancho
Director of Web Security at Rapid7 LLC
Founder at Bonsai Information Security
Project Leader at w3af

------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Reply via email to