Hi,

I am thinking again about the JIT support of callouts because it seems there 
are people who are interested. Theoretically calling a user function is easy. 
The problem is that the internal representation of the JIT compiled pattern is 
different from the interpreter. 

Here are the members of the pcre_callout_block structure:

typedef struct pcre_callout_block {
  int          version;           /* Identifies version of block */
    /* OK - Only a constant. */
  /* ------------------------ Version 0 ------------------------------- */
  int          callout_number;    /* Number compiled into pattern */
    /* OK - Only a constant. */
  int         *offset_vector;     /* The offset vector */
    /* See later. */
  PCRE_SPTR    subject;           /* The subject being matched */
   /* OK - easy to retrieve. */
  int          subject_length;    /* The length of the subject */
   /* OK - easy to retrieve. */
  int          start_match;       /* Offset to start of this match attempt */
   /* OK - easy to retrieve. */
  int          current_position;  /* Where we currently are in the subject */
   /* OK - easy to retrieve. */
  int          capture_top;       /* Max current capture */
    /* See later. */
  int          capture_last;      /* Most recently closed capture */
    /* See later. */
  void        *callout_data;      /* Data passed in with the call */
   /* OK - easy to retrieve. */
  /* ------------------- Added for Version 1 -------------------------- */
  int          pattern_position;  /* Offset to next item in the pattern */
    /* OK - Only a constant. */
  int          next_item_length;  /* Length of next item in the pattern */
    /* OK - Only a constant. */
  /* ------------------- Added for Version 2 -------------------------- */
  const unsigned char *mark;      /* Pointer to current mark or NULL    */
   /* OK - easy to retrieve. */
  /* ------------------------------------------------------------------ */
} pcre_callout_block;

Return value: equal to 0: continue match, greater than 0: backtrack, less than 
0: abandon the match and return with this value. Easy to support these options.

So most of the members are easy to support, except:

offset_vector - the current offsets are not stored in the offset vector, they 
are stored in the stack, and they are character pointers (-1 offset is 
represented by subject_start - 1). Converting them back takes a lot of time. 
Furthermore in the "optimized" case, the start offset is updated when we enter 
into a capturing block, so the value pair may be inconsistent. This 
optimization is disabled, if a particular offset pair is referenced by a 
backreference or a conditional block. In this "unoptimized" case we use an 
extra temporary value to store the offset, after we enter into a capturing 
block. Callouts may disable this optimization entirely, so we can sacrifice 
some performance to make callouts more consistent.

capture_top, capture_last - these are not stored by the JIT compiler. The 
capture_top is calculated when a successful match is found, by searching the 
last non -1 offset starting from the last offset. Once again: JIT stores the 
offsets on the stack, and it always has enough space to store all offsets 
(unlike the interpreter when a limited ovector is passed). After the match is 
finished, the necessary values are copied back to the ovector, and converted to 
offsets. The capture_last is not maintained at all.

The question is what to do. Is it worth to implement a restricted callout 
mechanism (some members are set to an invalid value)? What should we do with 
the ovector? And a theoretical question: is JIT worth when we call expensive C 
functions?

Any feedback is welcome.

Thanks,
Zoltan


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to