I somehow missed it in the last dozen times I've searched for a similar module, but it looks like Sherzod Ruzmetov's Parse::Syntax is designed to do just what I've done. Luckily, it's listed as umimplemented, so my efforts aren't as wasted as they could have been. I've been calling mine Syntax::Highlight locally, in the spirit of Syntax::Highlight::Perl (which is the only module in the Syntax:: namespace). HTML::SyntaxHighlighter also exists (which is a horrible name), and I saw Log::Colorize discussed here this past June, but those were the only ones.

The module is a customizable and extensible language-neutral syntax highlighter. To get the syntax for a particular language, it uses grammar files from EditPlus as of now, but only supports a subset of its features. That can be changed to whatever, as I haven't solidified any license details on using the grammars. Parsing a grammar file is pretty trivial in the grand scheme of things, so a change would be pretty quick if neccessary.

First, for those interested, a demonstration: http://lorax.no-ip.com/cgi-bin/test.cgi
and a current code listing: http://lorax.no-ip.com/cgi-bin/highlight.cgi


One major limitation of the method I use to parse right now is that delimiters can't be in keywords. That basically means highlighting markup languages like HTML where < / > are all delimiters as well as commonly part of keywords work a little strange. 'a' is considered a keyword for an anchor tag, so in <a href=... tags, the 'a' is highlighted, but also every bareword 'a' in the entire document, many (most) of which won't be anchor tags. For this reason, though I haven't checked compatibility, I think I'm going to see about outsourcing any HTML markup to HTML::SyntaxHighlighter. This brings up the subclassing issue and just using this module as a generic interface to language-specific syntax parsers...but if I start on that, this E-mail is going to be a lot longer than it needs to be right now.

I have two related issues. First, since it has a supporting data file that's required to run, how it that distributed? Does it get installed somewhere with the module itself and the installation can alter the module's code to refer to the installed location? Or do I just include the file in the distribution as an example input and force the user to put it somewhere and reference it with the module's runtime config?

Second, I've been doing a lot of benchmarking on the highlighter itself, but scanning and loading from a big composite grammar file to get the language syntax before the highlighter even starts is now the long spot. 35% of the total highlighting time is spent reading the grammar for a 600 line file. Considering a module like this would be best used in programmers' forums, the amount of code to highlight would be significantly less, pushing the grammar parse percentage even higher. Since the grammars aren't going to change much, there's really no reason to parse it every time to highlight. In the module, each language's grammar is just a hash-based data structure. With all the good serializers available, they could just be dumped to a file with Storable at worst, or inserted into a BLOB-type field in a database at best.

I guess at this point I'm thinking just depend on the user to supply the grammar data at runtime, but give them options on how to supply it (in the future). I'll hold off on grammar caching for now, as I don't suspect this is going to be in time-sensitive places very soon. Granted, it's not slow. I've spend substantial time and effort benchmarking, profiling, and optimizing. For example, my P3-450 running FreeBSD highlights the 7125 line CGI.pm in just over seven seconds, my XP2000 running Win2k does it 3.2s.

Still, I'd love to hear some ideas on how to best handle caching, target namespace, or any other thoughts/reactions to this grand scheme.

Thanks for reading,
Andrew

PS - I'd like to thank the authors of Devel::ptkdb, Devel::Profile, and Benchmark::Timer for making my life easier.



Reply via email to