cvsuser     02/07/16 17:39:46

  Added:       .        rx.dev
  Log:
  Added dev file for rx stuff, courtesy Stephen Rawls ([EMAIL PROTECTED])
  
  Revision  Changes    Path
  1.1                  parrot/rx.dev
  
  Index: rx.dev
  ===================================================================
  =head1 NAME
  
  rx.c / rx.h
  
  =head1 SUMMARY
  
  rx.c and rx.h set up functions to be used by the regular expression engine.  
  They also define internal helper functions that add a layer of abstraction to
  the rx_is_X family of functions.  Please also see C<rx.ops>, C<rxstacks.c>, 
  and C<rxstacks.h>.
  
  =head2 rx.c
  
  =over 4
  
  =item B<rx_alloacate_info>
  
  Initializes a regular expression object and allocates the memory.
  
  =back
  
  B<rx_is_word_character>
  
  B<rx_is_number_character>
  
  B<rx_is_whitespace_character>
  
  =item B<rx_is_newline>
  
  These functions check if the character passed as an argument is a 
  word_character, number_character, whitespace_character, or a newline, 
  respectively.  They each use bitmaps to add a layer of abstraction.  All a 
  bitmap is in this case is a collection of characters.  Instead of manually 
  looking at a string these functions create a bitmap of allowable characters 
  (using predefined constants, like RX_WORDCHARS), and call the function 
  C<bitmap_match>, which checks if the supplied character is in the 
  bitmap.  Not only do bitmaps add abstraction, but they provide a significant 
  increase in speed over a linear search.
  
  
  NOTE: The C<rx_is_number_character> function breaks the abstraction and 
  uses the following expression to test the argument:
  
        if (ch >= '0' && ch <= '9')
  
  It explains that it is "faster to do less-than/greater-than"
  Basically, this is just a speed hack for now, it will change when it needs to 
  be changed (to add different encoding/language support).
  
  =item B<bitmap_make>
  
  This function makes a bitmap from its argument (of type STRING*).  Let us 
  examine two cases, one is a character is one byte, the other is it is more.
  
  =over 1
  
  =item B<One byte>
  
  First of all, (255 >> 3) = 31.  The code uses this for a little efficiency in 
  storage/speed.  An internal array is created with 32 elements (each byte-sized).
  If you take the input character and right shift it by 3, you will get a number
  between 0 and 31, it just so happens that exactly 8 numbers between 0 and 
  255 map to the same number between 0 and 31.  Then, each element in this array
  is a bitfield, with a 1 or 0 in each bit to indicate if a particular character
  is in the bitmap or not.  So, (ch >> 3) takes us to the right element in the
  array for ch, but how do we get to the right element in the bitfield?  The 
  code is 1 << (ch & 7).  This will give us a unique power of two for each
  character that maps to that particular bitfield in the array.
  
  =item More than one byte
  
  Here each character is appended to the internal string bigchars (of type
  STRING*).
  
  =back
  
  =item B<bitmap_make_cstr>
  
  This is the same thing at bitmap_make, except it is called with a const char*
  argument.  Because of this, it knows there will be no bigchars, so it is only
  concerned with byte-sized characters.
  
  =item B<bitmap_add>
  
  This function takes a bitmap and a single character, and adds that character
  to the bitmap.  The code for adding the character is the same as in bitmap_make.
  
  =item B<bitmap_match>
  
  This functions takes a bitmap and a single character, and checks to see if that
  character is in the bitmap.  If the character is more than one byte, then the
  function searches the bigchars string linearly (one by one).  If it is a
  byte-sized character than it checks the appropriate bitfield, as specified in
  bitmap_make.
  
  =item B<bitmap_destroy>
  
  This deallocates the memory for the bitmap.
  
  =back
  
  =head1 rx.h
  
  =over 4
  
  Here is the definition for rxinfo (all comments are mine)
  
  typedef struct rxinfo {
      STRING *string;   //This is the string the regex tests to see if it matches or 
not
      INTVAL index;     //This is the current spot in string we are checking
      INTVAL startindex;        //This is where the regex started checking
      INTVAL success;   //This is just a flag to see if the regex matched or not
  
      rxflags flags;    //This is a set of flags to see what modifiers were used in 
the regex
      UINTVAL minlength;  //The minumum length string can be and still be able to 
match
      rxdirection whichway; //Is the regex going forwards or backwards?
  
      PMC *groupstart;  //Indexes for where each group starts
      PMC *groupend;    //Indexes for where each gruop ends
                        //Groups here are capturing groups, ie. $1,$2, etc.
  
      opcode_t *substfunc; //This is unused.  Originally regexes were going to
                         //handle their own substitutions (s///).  Now this
                         //is not the case.  This can probably be removed.
  
      IntStack stack;   //Sets up an intstack for internal use (backtrackig purposes)
  } rxinfo;
  
  
  rx.h also sets up a series of macros for setting/unsetting flags in each regex,
  advancing the regex one char (or a given number of chars), and finding the
  current index.  Here is the list of the macros, check the rx.h file for their
  definitions.
  
  =item B<RX_dUNPACK(pmc)>
  
  =item B<RxCurChar(rx)>
  
  =item B<RxAdvance(rx)>
  
  =item B<RxAdvanceX(rx, x)>
  
  =item B<RxCaseInsensitive_on(rx)>
  
  =item B<RxCaseInsensitive_off(rx)>
  
  =item B<RxCaseInsensitive_test(rx)>
  
  =item B<RxSingleLine_on(rx)>
  
  =item B<RxSingleLine_off(rx)>
  
  =item B<RxSingleLine_test(rx)>
  
  =item B<RxMultiline_on(rx)>
  
  =item B<RxMultiline_off(rx)>
  
  =item B<RxMultiline_test(rx)>
  
  =item B<RxReverse_on(rx)>
  
  =item B<RxReverse_off(rx)>
  
  =item B<RxReverse_test(rx)>
  
  =item B<RxFlagOn(rx, flag)>
  
  =item B<RxFlagOff(rx, flag)>
  
  =item B<RxFlagTest(rx, flag)>
  
  =item B<RxFlagsOff(rx)>
  
  =back
  
  
  


Reply via email to