On Tuesday 13 May 2008, Robert Fischer wrote: > > The problem, as I understand it, is in writting parsers. Your standard > > finite automata based regular expression library or lexical analyzer is > > based, at it's heart, on a table lookup- you have a 2D array, whose size > > is the number of input characters times the number of states. For ASCII > > input, the number of possible input characters is small- 256 at most. > > 256 input characters times hundreds of states isn't that big of a table- > > we're looking at sizes in 10's of K- easily handlable even in the bad > > old days of 64K segments. Even going to UTF-16 ups the number of input > > characters from 256 to 65,536- and now a moderately large state machine > > (hundreds of states) weighs in at tens of megabytes of table space. > > And, of course, if you try to handle the entire 31-bit full unicode > > point space, welcome to really large tables :-). > > > > The solution, I think, is to change the implementation of your finite > > automata to use some data structure smarter than a flat 2D array, but > > that's me. > > Yes. It is certainly possible to write slow code to solve this problem.
With "slow code" you could have been meaning two things: 1. Table lookup globally replaced by compares-and-jumps. The latter benefit from branch prediction and speculative execution. So it's not slow anymore. 2. Table "compression" used, where a few compare-and-jumps remove huge "unused" swaths of the table. By "unused" I meant "bomb out with an internal error". I think you're being silly. Stop it. Cheers, Kuba _______________________________________________ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs