Just out of curiosity... I'm guessing that there are two levels of caching that could be used. One would be to hold a global dictionary of some sort of pattern -> Regex that will save the lengthy parsing of the re.
The other would be to actually call Regex.CompileToAssembly to get a more efficient reperesentation (in runtime that is) of the re. Would you (the IP team) favor the former or the latter? Shechter. Birsch wrote: > I checked both sgmllib.py and BeautifulSoup.py - and it seems both are > reusing the same regexps (searched for re.compile). > > I think your suggestion is very relevant in this case. It makes sense to > replicate the "compile once use many" behavior that is commonly used > with regexp. > > -Birsch > > On Thu, Feb 21, 2008 at 7:30 PM, Dino Viehland > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Do you know if the same reg ex is being used repeatedly? If so > maybe we can cache & compile the regex instead of interpretting it > all the time. > > > > *From:* [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > [mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>] *On Behalf Of *Birsch > *Sent:* Thursday, February 21, 2008 8:30 AM > > *To:* Discussion of IronPython > *Subject:* Re: [IronPython] Slow Performance of CPython libs? > > > > I took on Cooper's advice and profiled with ANTS. Here are the top > methods: > > *Namespace* > > > > *Method name* > > > > *Time (sec.)* > > > > *Time with children (sec.)* > > > > *Hit count* > > > > *Source file* > > System.Text.RegularExpressions > > > > RegexInterpreter.Go() > > > > 37.0189 > > > > 94.4676 > > > > 13689612 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Operator() > > > > 6.2411 > > > > 6.2411 > > > > 131146274 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Advance(int i) > > > > 5.9264 > > > > 8.7202 > > > > 66000263 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.SetOperator(int op) > > > > 5.5750 > > > > 5.5750 > > > > 131146274 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Backtrack() > > > > 5.5692 > > > > 9.4895 > > > > 37781343 > > > > IronPython.Runtime.Operations > > > > Ops.CallWithContext(ICallerContext context, object func, object > arg0, object arg1) > > > > 5.5572 > > > > 114.5245 > > > > 79754 > > > > IronPython.Runtime.Calls > > > > Method.Call(ICallerContext context, object arg0) > > > > 4.9052 > > > > 114.8251 > > > > 50886 > > > > IronPython.Runtime.Calls > > > > PythonFunction.CallInstance(ICallerContext context, object arg0, > object arg1) > > > > 4.8876 > > > > 114.8059 > > > > 50886 > > > > IronPython.Runtime.Calls > > > > Function2.Call(ICallerContext context, object arg0, object arg1) > > > > 4.6400 > > > > 114.5471 > > > > 47486 > > > > IronPython.Runtime.Operations > > > > Ops.CallWithContext(ICallerContext context, object func, object arg0) > > > > 4.2344 > > > > 114.1604 > > > > 146658 > > > > System.Text.RegularExpressions > > > > RegexBoyerMoore.Scan(string text, int index, int beglimit, int endlimit) > > > > 3.6465 > > > > 3.6465 > > > > 13678131 > > > > System.Text.RegularExpressions > > > > RegexCharClass.CharInClassRecursive(char ch, string set, int start) > > > > 3.6288 > > > > 5.7113 > > > > 31508162 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Goto(int newpos) > > > > 3.2058 > > > > 5.1470 > > > > 27364668 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Operand(int i) > > > > 3.1923 > > > > 3.1923 > > > > 73230687 > > > > System.Text.RegularExpressions > > > > RegexRunner.EnsureStorage() > > > > 3.0803 > > > > 3.0803 > > > > 51474823 > > > > System.Text.RegularExpressions > > > > RegexCharClass.CharInClass(char ch, string set) > > > > 3.0713 > > > > 8.7827 > > > > 31508162 > > > > IronPython.Runtime.Calls > > > > Method.Call(ICallerContext context, object arg0, object arg1) > > > > 2.9821 > > > > 7.8675 > > > > 15012 > > > > IronPython.Runtime.Calls > > > > PythonFunction.CallInstance(ICallerContext context, object arg0, > object arg1, object arg2) > > > > 2.9794 > > > > 7.8639 > > > > 15012 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Forwardcharnext() > > > > 2.8852 > > > > 2.8852 > > > > 62865185 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Forwardchars() > > > > 2.8279 > > > > 2.8279 > > > > 59436277 > > > > System.Text.RegularExpressions > > > > RegexCharClass.CharInClassInternal(char ch, string set, int start, > int mySetLength, int myCategoryLength) > > > > 2.0632 > > > > 2.0826 > > > > 31508162 > > > > System.Text.RegularExpressions > > > > RegexRunner.Scan(Regex regex, string text, int textbeg, int textend, > int textstart, int prevlen, bool quick) > > > > 1.8376 > > > > 101.7226 > > > > 43009 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.FindFirstChar() > > > > 1.6405 > > > > 5.3456 > > > > 13701755 > > > > IronPython.Runtime.Types > > > > OldClass.TryLookupSlot(SymbolId name, out object ret) > > > > 1.5573 > > > > 2.8124 > > > > 389516 > > > > IronPython.Runtime.Operations > > > > Ops.GetAttr(ICallerContext context, object o, SymbolId name) > > > > 1.5365 > > > > 5.3456 > > > > 558524 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Textpos() > > > > 1.4020 > > > > 1.4020 > > > > 32648926 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Advance() > > > > 1.1916 > > > > 2.9526 > > > > 13703950 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Textto(int newpos) > > > > 1.1218 > > > > 1.1218 > > > > 24120890 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPeek() > > > > 1.0579 > > > > 1.0579 > > > > 24120894 > > > > System.Text.RegularExpressions > > > > Regex.Run(bool quick, int prevlen, string input, int beginning, int > length, int startat) > > > > 0.7280 > > > > 102.4644 > > > > 43009 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPush(int I1) > > > > 0.6834 > > > > 0.6834 > > > > 13745149 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.StackPush(int I1) > > > > 0.6542 > > > > 0.6542 > > > > 13703955 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPop() > > > > 0.6068 > > > > 0.6068 > > > > 13663035 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPush() > > > > 0.6049 > > > > 0.6049 > > > > 13708230 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.StackPop() > > > > 0.5836 > > > > 0.5836 > > > > 13703956 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.Bump() > > > > 0.4987 > > > > 0.4987 > > > > 10472790 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPush(int I1, int I2) > > > > 0.4864 > > > > 0.4864 > > > > 10472790 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPeek(int i) > > > > 0.4663 > > > > 0.4663 > > > > 10457859 > > > > System.Text.RegularExpressions > > > > RegexInterpreter.TrackPop(int framesize) > > > > 0.4396 > > > > 0.4396 > > > > 10457859 > > > > > Moving up the stack of regex.Go(), most calls originate from > sgmllib's parse_starttag. > > HTH, > -Birsch > > On Thu, Feb 21, 2008 at 2:35 PM, Birsch <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Thanks Michael and Dino. > > I'll prof and send update. Got a good profiler recommendation for .Net? > Meanwhile I noticed the sample site below causes BeautifulSoup to > generate quite a few [python] exceptions during __init__. Does > IronPython handle exceptions significantly slower than CPtyhon? > > Repro code is simple (just build a BeautifulSoup obj with mininova's > home page). > Here are the .py and .cs I used to time the diffs: > > *bstest.py:* > #Bypass CPython default socket implementation with IPCE/FePy > import imp, os, sys > sys.modules['socket'] = module = imp.new_module('socket') > execfile('socket.py', module.__dict__) > > from BeautifulSoup import BeautifulSoup > from urllib import urlopen > import datetime > > def getContent(url): > #Download html data > startTime = datetime.datetime.now() > print "Getting url", url > html = urlopen(url).read() > print "Time taken:", datetime.datetime.now() - startTime > > #Make soup > startTime = datetime.datetime.now() > print "Making soup..." > soup = BeautifulSoup(markup=html) > print "Time taken:", datetime.datetime.now() - startTime > > if __name__ == "__main__": > print getContent("www.mininova.org <http://www.mininova.org>") > > > *C#:* > using System; > using System.Collections.Generic; > using System.Text; > using IronPython.Hosting; > > namespace IronPythonBeautifulSoupTest > { > public class Program > { > public static void Main(string[] args) > { > //Init > System.Console.WriteLine("Starting..."); > DateTime start = DateTime.Now; > PythonEngine engine = new PythonEngine(); > > //Add paths: > //BeautifulSoup.py, socket.py, bstest.py located on exe dir > engine.AddToPath(@"."); > //CPython Lib (replace with your own) > engine.AddToPath(@"D:\Dev\Python\Lib"); > > //Import and load > TimeSpan span = DateTime.Now - start; > System.Console.WriteLine("[1] Import: " + > span.TotalSeconds); > DateTime d = DateTime.Now; > engine.ExecuteFile(@"bstest.py"); > span = DateTime.Now - d; > System.Console.WriteLine("[2] Load: " + span.TotalSeconds); > > //Execute > d = DateTime.Now; > engine.Execute("getContent(\"http://www.mininova.org\")"); > span = DateTime.Now - d; > System.Console.WriteLine("[3] Execute: " + > span.TotalSeconds); > span = DateTime.Now - start; > System.Console.WriteLine("Total: " + span.TotalSeconds); > > > } > } > } > > > On Wed, Feb 20, 2008 at 6:57 PM, Dino Viehland > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > We've actually had this issue reported once before a long time ago - > it's a very low CodePlex ID - > http://www.codeplex.com/IronPython/WorkItem/View.aspx?WorkItemId=651 > > We haven't had a chance to investigate the end-to-end scenario. If > someone could come up with a smaller simpler repro that'd be great. > Otherwise we haven't forgotten about it we've just had more > immediately pressing issues to work on :(. > > > -----Original Message----- > From: [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > [mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>] On Behalf Of Michael Foord > Sent: Wednesday, February 20, 2008 5:20 AM > To: Discussion of IronPython > Subject: Re: [IronPython] Slow Performance of CPython libs? > > Birsch wrote: > > Hi - We've been using IronPython successfully to allow extensibility > > of our application. > > > > Overall we are happy with the performance, with the exception of > > BeautifulSoup which seems to run very slowly: x5 or more time to > > execute compared to CPython. > > > > Most of the time seems to be spent during __init__() of BS, where the > > markup is parsed. > > > > We suspect this has to do with the fact that our CPython env is > > executing .pyc files and can precompile its libs, while the > IronPython > > environment compiles each iteration. We couldn't find a way to > > pre-compile the libs and then introduce them into the code, but > in any > > case this will result in a large management overhead since the amount > > of CPython libs we expose to our users contains 100's of modules. > > > > Any ideas on how to optimize? > > I think it is worth doing real profiling to find out where the time is > being spent during parsing. > > If it is spending most of the time in '__init__' then the time is > probably not spent in importing - so compilation isn't relevant and it > is a runtime performance issue. (Importing is much slower with > IronPython and at Resolver Systems we do use precompiled binaries - but > strangely enough it doesn't provide much of a performance gain.) > > Michael > http://www.manning.com/foord > > > > > Thanks, > > -Birsch > > > > Note: we're using FePy/IPCE libs with regular IP v1.1.1 runtime DLLs > > (this was done to overcome library incompatibilities and network > > errors). However, the relevant slow .py code (mainly SGMLParser and > > BeautifulSoup) is the same. > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Users mailing list > > Users@lists.ironpython.com <mailto:Users@lists.ironpython.com> > > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com > > > > _______________________________________________ > Users mailing list > Users@lists.ironpython.com <mailto:Users@lists.ironpython.com> > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com > _______________________________________________ > Users mailing list > Users@lists.ironpython.com <mailto:Users@lists.ironpython.com> > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com > > > > > > > _______________________________________________ > Users mailing list > Users@lists.ironpython.com <mailto:Users@lists.ironpython.com> > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Users mailing list > Users@lists.ironpython.com > http://lists.ironpython.com/listinfo.cgi/users-ironpython.com _______________________________________________ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com