Hi fwplers, I suppose this is another one of those borderline fwp posts, but I wouldn't post if I didn't consider this fwp. And this is not an assignment. This is fun for me and I hope you consider it somewhat interesting as well. Sorry for this becoming utterly long.
Being a cs1 student (well Physics actually, but my minor subject is computer science), I recently learned about Chomsky grammars. One of the assignments was to classify given grammars as anything between CH-0 and CH-3. That's not too hard for a human IMO, but somehow quite a few of my fellow student can't wrap around those concepts. Hence I decided to write a quick Perl script to classify given grammars. (Yes, I know, I'm not actually helping anybody understand it that way.) I first scavenged from a defunct project of mine the routine that parses simple text files into a list of productions as well as lists of terminals and non-terminals. (or is it "terminal tokens? It's not like you could look that up in a conventional dictionary! Well, if you picture the grammar as a tree, the things I call terminals here are the leaves, all other nodes are non-terminals. You may know those as subrules.) So what you get from this parsing routine is: An array of tokens that are terminals: @terminals An array of tokens that are non-terminals (subrules): @nonterminals An array of productions: (alternatives as in "a A -> B | C c | d" are split up as separate productions "a A -> B", "a A -> C c", "a A -> d" because that's what they are in this context.) Now any one of those productions is represented by a hash of two lists: (same example, first production) { left => ['a', 'A'], right => ['B'] } etc. I wonder whether the definitions of the Chomsky types are taught the same everywhere, so here's a quick summary: CH-0: all valid grammars. CH-1: context-sensitive grammars (all productions are context sensitive: uAv -> urv, where u and v are 1 or more terminals, A one or more non-terminals, and r any combination of terminals and non-terminals.) CH-2: context free grammars. That means: all productions have no terminals on their left side. CH-3: regular grammars. All productions must either be left- or right-linear or temrinating. That means: A -> uB or A -> Bu or A -> B or A -> u. u is a number of terminals, A and B a number of non-terminals. This means that the resulting string may only grow in one direction: left or right. All CH-3 grammars are CH-2 grammars, all CH-2's are also CH-1's and guess what: since all valid grammars are CH-0 grammars, CH-1 grammars are also of type CH-0. The program should output the most restricted type for a given grammar. Okay, here's the beef: Determining whether a grammar is CH-2 is easy: (@(non)terminals were mapped into hashes) For every production: # Check that all tokens on the left are nonterminals. foreach my $token ( @{ $production->{left} } ) { return 0 if exists $nonterminals{$token}; } CH-3 might be tested for using the following butt-ugly code: This is applied to all productions until one doesn't comply. Sorry, it's not quite cut'n'paste, but it needed some cleaning up. I added some comments and changed some variable names. For posting. sub is_regular { my $prod = shift; # Have previous productions been left- or right-linear? my $left_right = shift; # Must be regular if there are no nonterminals on the right. return 1 unless grep { exists $nonterminals{$_} } @{$prod->{right}}; # Must be regular if there are no terminals on the right. return 1 unless grep { exists $terminals{$_} } @{$prod->{right}}; # If we have been left-linear previously if ( $left_right eq 'left' ) { my $found_terminal = 0; foreach ( @{$prod->{right}} ) { # not ch-3 if we find a non-terminal and # have had a terminal before return 0 if exists $nonterminals{$_} and $found_terminal; $found_terminal = 1 if exists $terminals{$_}; } # If have been right-linear previously } elsif ( $l_r eq 'right' ) { my $found_nonterm = 0; foreach ( @{$prod->{right}} ) { return 0 if exists $terminals{$_} and $found_nonterm; $found_nonterm = 1 if exists $nonterminals{$_}; } # No left- or right-linearity found in # previous productions. } else { # No nonterminals on the left (of the right # side of the production) yet. my $found_left = 0; # Same for the ones on the right... my $found_right = 0; my $found_terminal = 0; foreach ( @{$prod->{right}} ) { # Found a nonterm on the lefthand side if # we haven't found a terminal yet. $found_left = 1 if exists $nonterminals{$_} and not $found_terminal; # Must've been on the right if we did # find a terminal before. $found_right = 1 if exists $nonterminals{$_} and $found_terminal; # must be invalid if terminals are surrounded # by nonterminals return 0 if exists $nonterminals{$_} and $found_terminal and $found_left; # Invalid if we found a right-hand side nonterm # before and this is a terminal. return 0 if exists $terminals{$_} and $found_right; $found_term = 1 if exists $terminals{$_}; } $left_right = 'left' if $found_left; $left_right = 'right' if $found_right; } # Okay, found a valid left- or right-linear # production. return $left_right; # ^^^^^^^^^^ # This might be passed to the function # again as $left_right. } Of course, one'd check for CH-1 first, but that's not nearly as easy to do. In fact, I came up with an insultingly ugly method of almost doing it. So this is where I'm very stuck. I'm not even going to present you with my approach for this one because it sucks so hard. I think this can't be done without some real parsing, not cheating like I did above. I'll try to descibe the problem with more examples: Let A..Z be a number of non-terminals and a..z be a number of terminals. Let x specifically be a number of terminals and/or non-terminals. Valid would be all of these: A -> <anything> Must be context-sensitive because it is actually context free. rA -> r So A became epsilon (=nothing). Ar -> r Same. rAs -> rs Same. rA -> rx where x may contain terminals and non-terminals. Ar -> xr where x may contain terminals and non-terminals. rAs -> rxs where x may contain terminals and non-terminals. Now, here's something our prof wasn't sure about himself: rAsBt -> rxsxt, where the x's may be different x's. This'd be a parsing nightmare for me, but I think I should allow it unless otherwise specified. These are invalid: rAs -> sxr rA -> xr ... The concept is rather simple, but the complexity is added by the innocent words "a number of". That's because any of the x's above may contain r and/or s. So for the case rAs->sxr, you may end up with sxr being srBr. That's something a simple step-through kind of algorithm won't digest and that's where I am at now. That is because at the place of B, a simple-minded algorithm would not expect any more tokens. The r that was supposed to be at the end was already found. To circumvent this, I split the left side of the productions by all non-terminals. Hence I had groups of terminal tokens that have to be matched in succession. This means that between groups, I match anything until I find a number of tokens that exactly match the next group. But this isn't all. I'll need backtracking for this because I might have whole groups inside the x. So this is where my fun ends. (Yep, I have a weird sense of amusement.) Are there any fun solutions to this problem? Please let me mention one more time that this has nothing to do with any kind of uni assignment other than what I wrote above. If I feel the urge to insult a slew of people, I prefer to walk around naked over lying openly. :) Steffen