On 07/19/2011 02:54 PM, Andres Riancho wrote:
> Javier,
>
> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia
> <javier_anda...@rapid7.com>  wrote:
>> List,
>>
>> This is my attempt to improve the performance of the xpath evaluation given
>> a DOM Element.
>> The original (and current) version is in httpResponse.py. Examples of how
>> this is used can be found at:
>> ajax.py, fileUpload.py, formAutocomplete.py, etc
>>
>>
>>     def getDOM2(self):
>>
>>         '''
>>
>>         TODO: Put docstring here
>>
>>         '''
>>
>>         class DOM(object):
>>
>>             def xpath(self, tag, xpathpredicate='.'):
>>
>>                 xpath = etree.XPath(xpathpredicate)
>>
>>                 root = etree.fromstring(self.body,
>>
>>                                         etree.HTMLParser(recover=True))
>>
>>
>>                 context = etree.iterwalk(root, events=('start',), tag=tag)
>>
>>                 try:
>>
>>                     for evt, elem in context:
>>
>>                         if xpath(elem):
>>
>>                             yield elem
>>
>>                         while elem.getprevious() is not None:
>>
>>                             del elem.getparent()[0]
>>
>>                 except etree.XPathSyntaxError:
>>
>>                         om.out.debug('Invalid XPath expression: "%s"' %
>>
>>                                      xpathpredicate)
>>
>>                         raise
>>
>>                 del context
>      Are you sure that this is equivalent to the old implementation?

What do you mean? It is certainly a little more complex but still 
equivalent.

>      I'm guessing that the old implementation is faster because it's C
> with a Python wrapper and this is "python calling many times different

That make sense. Additionally, I think it is slower because the xpath 
evaluation occurs *only once* in the original implementation. I 
definitely misunderstood what was explained in section "Finding elements 
quickly" of [1] where they focus on the use of 'find' and 'findall' vs 
more efficient alternatives. We use in our code simple and direct xpath 
evaluation. Seems that anything can't be faster than that.

Javier


[1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

> C functions" ? Have you tested [0] to see WHERE the CPU is consumed?
>
> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
>
>>         dom = DOM()
>>
>>         dom.body = self.body
>>
>>         return dom
>>
>>
>>
>> Unfortunately this didn't work out as expected. It is slower.
>>
>>>>>   code = '''
>> f = open("index-form-two-fields.html")
>>
>> html = f.read()
>>
>> f.close()
>>
>> u = url_object('http://w3af.com')
>>
>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>> 'text/html'}, u, u)
>>
>> for i in res.getDOM2().xpath('input',
>> "translate(@type,'PASWORD','pasword')='password'"):
>>
>>     pass
>>
>> '''
>>
>>>>>   setup = '''import sys
>> sys.path.append('/home/jandalia/workspace/w3af.unicode');
>>
>> from core.data.parsers.urlParser import url_object;
>>
>> import core.data.url.httpResponse
>>
>> '''
>>
>>>>>   t = timeit.Timer(code, setup)
>>>>>   min(t.repeat(repeat=3, number=10000))
>> 27.584304094314575
>>
>>
>> Using the original version:
>>
>>>>>   code = '''
>> f = open("/home/jandalia/Desktop/index-form-two-fields.html")
>>
>> html = f.read()
>>
>> f.close()
>>
>> u = url_object('http://w3af.com')
>>
>> res = core.data.url.httpResponse.httpResponse(200, html, {'content-type':
>> 'text/html'}, u, u)
>>
>> dom = res.getDOM()
>>
>> for i in
>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"):
>>
>>     pass
>>
>> '''
>>
>>>>>   t = timeit.Timer(code, setup)
>>>>>   min(t.repeat(repeat=3, number=10000))
>> 3.8396580219268799
>>
>>
>> In other words, it is about 7 times slower.
>> If anyone has an idea on how to improve this code it would be very
>> appreciated. The html doc used for the tests. is attached.
>>
>> Thanks!
>>
>> Javier
>>
>> Note: Some useful info can be found here:
>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Magic Quadrant for Content-Aware Data Loss Prevention
>> Research study explores the data loss prevention market. Includes in-depth
>> analysis on the changes within the DLP market, and the criteria used to
>> evaluate the strengths and weaknesses of these DLP solutions.
>> http://www.accelacomm.com/jaw/sfnl/114/51385063/
>> _______________________________________________
>> W3af-develop mailing list
>> W3af-develop@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/w3af-develop
>>
>>
>
>


------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Reply via email to