Just out of curiosity...
I'm guessing that there are two levels of caching that could be used.
One would be to hold a global dictionary of some sort of pattern -> 
Regex that will save the lengthy parsing of the re.

The other would be to actually call Regex.CompileToAssembly to get a 
more efficient reperesentation (in runtime that is) of the re.

Would you (the IP team) favor the former or the latter?

        Shechter.


Birsch wrote:
> I checked both sgmllib.py and BeautifulSoup.py - and it seems both are 
> reusing the same regexps (searched for re.compile).
> 
> I think your suggestion is very relevant in this case. It makes sense to 
> replicate the "compile once use many" behavior that is commonly used 
> with regexp.
> 
> -Birsch
> 
> On Thu, Feb 21, 2008 at 7:30 PM, Dino Viehland 
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> 
>     Do you know if the same reg ex is being used repeatedly?  If so
>     maybe we can cache & compile the regex instead of interpretting it
>     all the time.
> 
>      
> 
>     *From:* [EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>
>     [mailto:[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>] *On Behalf Of *Birsch
>     *Sent:* Thursday, February 21, 2008 8:30 AM
> 
>     *To:* Discussion of IronPython
>     *Subject:* Re: [IronPython] Slow Performance of CPython libs?
> 
>      
> 
>     I took on Cooper's advice and profiled with ANTS. Here are the top
>     methods:
> 
>     *Namespace*
> 
>       
> 
>     *Method name*
> 
>       
> 
>     *Time (sec.)*
> 
>       
> 
>     *Time with children (sec.)*
> 
>       
> 
>     *Hit count*
> 
>       
> 
>     *Source file*
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Go()
> 
>       
> 
>     37.0189
> 
>       
> 
>     94.4676
> 
>       
> 
>     13689612
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Operator()
> 
>       
> 
>     6.2411
> 
>       
> 
>     6.2411
> 
>       
> 
>     131146274
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Advance(int i)
> 
>       
> 
>     5.9264
> 
>       
> 
>     8.7202
> 
>       
> 
>     66000263
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.SetOperator(int op)
> 
>       
> 
>     5.5750
> 
>       
> 
>     5.5750
> 
>       
> 
>     131146274
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Backtrack()
> 
>       
> 
>     5.5692
> 
>       
> 
>     9.4895
> 
>       
> 
>     37781343
> 
>       
> 
>     IronPython.Runtime.Operations
> 
>       
> 
>     Ops.CallWithContext(ICallerContext context, object func, object
>     arg0, object arg1)
> 
>       
> 
>     5.5572
> 
>       
> 
>     114.5245
> 
>       
> 
>     79754
> 
>       
> 
>     IronPython.Runtime.Calls
> 
>       
> 
>     Method.Call(ICallerContext context, object arg0)
> 
>       
> 
>     4.9052
> 
>       
> 
>     114.8251
> 
>       
> 
>     50886
> 
>       
> 
>     IronPython.Runtime.Calls
> 
>       
> 
>     PythonFunction.CallInstance(ICallerContext context, object arg0,
>     object arg1)
> 
>       
> 
>     4.8876
> 
>       
> 
>     114.8059
> 
>       
> 
>     50886
> 
>       
> 
>     IronPython.Runtime.Calls
> 
>       
> 
>     Function2.Call(ICallerContext context, object arg0, object arg1)
> 
>       
> 
>     4.6400
> 
>       
> 
>     114.5471
> 
>       
> 
>     47486
> 
>       
> 
>     IronPython.Runtime.Operations
> 
>       
> 
>     Ops.CallWithContext(ICallerContext context, object func, object arg0)
> 
>       
> 
>     4.2344
> 
>       
> 
>     114.1604
> 
>       
> 
>     146658
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexBoyerMoore.Scan(string text, int index, int beglimit, int endlimit)
> 
>       
> 
>     3.6465
> 
>       
> 
>     3.6465
> 
>       
> 
>     13678131
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexCharClass.CharInClassRecursive(char ch, string set, int start)
> 
>       
> 
>     3.6288
> 
>       
> 
>     5.7113
> 
>       
> 
>     31508162
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Goto(int newpos)
> 
>       
> 
>     3.2058
> 
>       
> 
>     5.1470
> 
>       
> 
>     27364668
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Operand(int i)
> 
>       
> 
>     3.1923
> 
>       
> 
>     3.1923
> 
>       
> 
>     73230687
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexRunner.EnsureStorage()
> 
>       
> 
>     3.0803
> 
>       
> 
>     3.0803
> 
>       
> 
>     51474823
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexCharClass.CharInClass(char ch, string set)
> 
>       
> 
>     3.0713
> 
>       
> 
>     8.7827
> 
>       
> 
>     31508162
> 
>       
> 
>     IronPython.Runtime.Calls
> 
>       
> 
>     Method.Call(ICallerContext context, object arg0, object arg1)
> 
>       
> 
>     2.9821
> 
>       
> 
>     7.8675
> 
>       
> 
>     15012
> 
>       
> 
>     IronPython.Runtime.Calls
> 
>       
> 
>     PythonFunction.CallInstance(ICallerContext context, object arg0,
>     object arg1, object arg2)
> 
>       
> 
>     2.9794
> 
>       
> 
>     7.8639
> 
>       
> 
>     15012
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Forwardcharnext()
> 
>       
> 
>     2.8852
> 
>       
> 
>     2.8852
> 
>       
> 
>     62865185
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Forwardchars()
> 
>       
> 
>     2.8279
> 
>       
> 
>     2.8279
> 
>       
> 
>     59436277
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexCharClass.CharInClassInternal(char ch, string set, int start,
>     int mySetLength, int myCategoryLength)
> 
>       
> 
>     2.0632
> 
>       
> 
>     2.0826
> 
>       
> 
>     31508162
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexRunner.Scan(Regex regex, string text, int textbeg, int textend,
>     int textstart, int prevlen, bool quick)
> 
>       
> 
>     1.8376
> 
>       
> 
>     101.7226
> 
>       
> 
>     43009
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.FindFirstChar()
> 
>       
> 
>     1.6405
> 
>       
> 
>     5.3456
> 
>       
> 
>     13701755
> 
>       
> 
>     IronPython.Runtime.Types
> 
>       
> 
>     OldClass.TryLookupSlot(SymbolId name, out object ret)
> 
>       
> 
>     1.5573
> 
>       
> 
>     2.8124
> 
>       
> 
>     389516
> 
>       
> 
>     IronPython.Runtime.Operations
> 
>       
> 
>     Ops.GetAttr(ICallerContext context, object o, SymbolId name)
> 
>       
> 
>     1.5365
> 
>       
> 
>     5.3456
> 
>       
> 
>     558524
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Textpos()
> 
>       
> 
>     1.4020
> 
>       
> 
>     1.4020
> 
>       
> 
>     32648926
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Advance()
> 
>       
> 
>     1.1916
> 
>       
> 
>     2.9526
> 
>       
> 
>     13703950
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Textto(int newpos)
> 
>       
> 
>     1.1218
> 
>       
> 
>     1.1218
> 
>       
> 
>     24120890
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPeek()
> 
>       
> 
>     1.0579
> 
>       
> 
>     1.0579
> 
>       
> 
>     24120894
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     Regex.Run(bool quick, int prevlen, string input, int beginning, int
>     length, int startat)
> 
>       
> 
>     0.7280
> 
>       
> 
>     102.4644
> 
>       
> 
>     43009
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPush(int I1)
> 
>       
> 
>     0.6834
> 
>       
> 
>     0.6834
> 
>       
> 
>     13745149
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.StackPush(int I1)
> 
>       
> 
>     0.6542
> 
>       
> 
>     0.6542
> 
>       
> 
>     13703955
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPop()
> 
>       
> 
>     0.6068
> 
>       
> 
>     0.6068
> 
>       
> 
>     13663035
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPush()
> 
>       
> 
>     0.6049
> 
>       
> 
>     0.6049
> 
>       
> 
>     13708230
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.StackPop()
> 
>       
> 
>     0.5836
> 
>       
> 
>     0.5836
> 
>       
> 
>     13703956
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.Bump()
> 
>       
> 
>     0.4987
> 
>       
> 
>     0.4987
> 
>       
> 
>     10472790
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPush(int I1, int I2)
> 
>       
> 
>     0.4864
> 
>       
> 
>     0.4864
> 
>       
> 
>     10472790
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPeek(int i)
> 
>       
> 
>     0.4663
> 
>       
> 
>     0.4663
> 
>       
> 
>     10457859
> 
>       
> 
>     System.Text.RegularExpressions
> 
>       
> 
>     RegexInterpreter.TrackPop(int framesize)
> 
>       
> 
>     0.4396
> 
>       
> 
>     0.4396
> 
>       
> 
>     10457859
> 
>       
> 
> 
>     Moving up the stack of regex.Go(), most calls originate from
>     sgmllib's parse_starttag.
> 
>     HTH,
>     -Birsch
> 
>     On Thu, Feb 21, 2008 at 2:35 PM, Birsch <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
> 
>     Thanks Michael and Dino.
> 
>     I'll prof and send update. Got a good profiler recommendation for .Net?
>     Meanwhile I noticed the sample site below causes BeautifulSoup to
>     generate quite a few [python] exceptions during __init__. Does
>     IronPython handle exceptions significantly slower than CPtyhon?
> 
>     Repro code is simple (just build a BeautifulSoup obj with mininova's
>     home page).
>     Here are the .py and .cs I used to time the diffs:
> 
>     *bstest.py:*
>     #Bypass CPython default socket implementation with IPCE/FePy
>     import imp, os, sys
>     sys.modules['socket'] = module = imp.new_module('socket')
>     execfile('socket.py', module.__dict__)
> 
>     from BeautifulSoup import BeautifulSoup
>     from urllib import urlopen
>     import datetime
> 
>     def getContent(url):
>         #Download html data
>         startTime = datetime.datetime.now()
>         print "Getting url", url
>         html = urlopen(url).read()
>         print "Time taken:", datetime.datetime.now() - startTime
>        
>         #Make soup
>         startTime = datetime.datetime.now()
>         print "Making soup..."
>         soup = BeautifulSoup(markup=html)
>         print "Time taken:", datetime.datetime.now() - startTime
> 
>     if __name__ == "__main__":
>         print getContent("www.mininova.org <http://www.mininova.org>")
> 
> 
>     *C#:*
>     using System;
>     using System.Collections.Generic;
>     using System.Text;
>     using IronPython.Hosting;
> 
>     namespace IronPythonBeautifulSoupTest
>     {
>         public class Program
>         {
>             public static void Main(string[] args)
>             {
>                 //Init
>                 System.Console.WriteLine("Starting...");
>                 DateTime start = DateTime.Now;
>                 PythonEngine engine = new PythonEngine();
> 
>                 //Add paths:
>                 //BeautifulSoup.py, socket.py, bstest.py located on exe dir
>                 engine.AddToPath(@".");
>                 //CPython Lib (replace with your own)
>                 engine.AddToPath(@"D:\Dev\Python\Lib");
> 
>                 //Import and load
>                 TimeSpan span = DateTime.Now - start;
>                 System.Console.WriteLine("[1] Import: " +
>     span.TotalSeconds);
>                 DateTime d = DateTime.Now;
>                 engine.ExecuteFile(@"bstest.py");
>                 span = DateTime.Now - d;
>                 System.Console.WriteLine("[2] Load: " + span.TotalSeconds);
> 
>                 //Execute
>                 d = DateTime.Now;
>                 engine.Execute("getContent(\"http://www.mininova.org\";)");
>                 span = DateTime.Now - d;
>                 System.Console.WriteLine("[3] Execute: " +
>     span.TotalSeconds);
>                 span = DateTime.Now - start;
>                 System.Console.WriteLine("Total: " + span.TotalSeconds);
> 
> 
>             }
>         }
>     }
> 
> 
>     On Wed, Feb 20, 2008 at 6:57 PM, Dino Viehland
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
>     wrote:
> 
>     We've actually had this issue reported once before a long time ago -
>     it's a very low CodePlex ID -
>     http://www.codeplex.com/IronPython/WorkItem/View.aspx?WorkItemId=651
> 
>     We haven't had a chance to investigate the end-to-end scenario.  If
>     someone could come up with a smaller simpler repro that'd be great.
>      Otherwise we haven't forgotten about it we've just had more
>     immediately pressing issues to work on :(.
> 
> 
>     -----Original Message-----
>     From: [EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>
>     [mailto:[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>] On Behalf Of Michael Foord
>     Sent: Wednesday, February 20, 2008 5:20 AM
>     To: Discussion of IronPython
>     Subject: Re: [IronPython] Slow Performance of CPython libs?
> 
>     Birsch wrote:
>      > Hi - We've been using IronPython successfully to allow extensibility
>      > of our application.
>      >
>      > Overall we are happy with the performance, with the exception of
>      > BeautifulSoup which seems to run very slowly: x5 or more time to
>      > execute compared to CPython.
>      >
>      > Most of the time seems to be spent during __init__() of BS, where the
>      > markup is parsed.
>      >
>      > We suspect this has to do with the fact that our CPython env is
>      > executing .pyc files and can precompile its libs, while the
>     IronPython
>      > environment compiles each iteration. We couldn't find a way to
>      > pre-compile the libs and then introduce them into the code, but
>     in any
>      > case this will result in a large management overhead since the amount
>      > of CPython libs we expose to our users contains 100's of modules.
>      >
>      > Any ideas on how to optimize?
> 
>     I think it is worth doing real profiling to find out where the time is
>     being spent during parsing.
> 
>     If it is spending most of the time in '__init__' then the time is
>     probably not spent in importing - so compilation isn't relevant and it
>     is a runtime performance issue. (Importing is much slower with
>     IronPython and at Resolver Systems we do use precompiled binaries - but
>     strangely enough it doesn't provide much of a performance gain.)
> 
>     Michael
>     http://www.manning.com/foord
> 
>      >
>      > Thanks,
>      > -Birsch
>      >
>      > Note: we're using FePy/IPCE libs with regular IP v1.1.1 runtime DLLs
>      > (this was done to overcome library incompatibilities and network
>      > errors). However, the relevant slow .py code (mainly SGMLParser and
>      > BeautifulSoup) is the same.
>      >
>     ------------------------------------------------------------------------
>      >
>      > _______________________________________________
>      > Users mailing list
>      > Users@lists.ironpython.com <mailto:Users@lists.ironpython.com>
>      > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>      >
> 
>     _______________________________________________
>     Users mailing list
>     Users@lists.ironpython.com <mailto:Users@lists.ironpython.com>
>     http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>     _______________________________________________
>     Users mailing list
>     Users@lists.ironpython.com <mailto:Users@lists.ironpython.com>
>     http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> 
>      
> 
>      
> 
> 
>     _______________________________________________
>     Users mailing list
>     Users@lists.ironpython.com <mailto:Users@lists.ironpython.com>
>     http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Users mailing list
> Users@lists.ironpython.com
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
_______________________________________________
Users mailing list
Users@lists.ironpython.com
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Reply via email to