[Jprogramming] Matching on balanced delimiters

Raul Miller Sat, 20 Feb 2021 10:46:24 -0800

Regular expressions are handy things for matching simple patterns in text.

In current J versions, you can get an overview this way:
   'rx' names_z_ ''


and, in jqt:
   open '~system/main/regex.ijs'

However, there's one thing that regular expressions are notoriously
bad at handling: balanced delimiters.

The theory behind regular expressions begins with an intentional
simplification which avoids counting. And, without counting, you
cannot track nesting depth.

Meanwhile, J has an even simpler approach for tracking nesting depth.
For example:

   +/\ (=&'(' - =&')') '(a (nesting (depth (example))))...'
1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 3 2 1 0 0 0 0

This has a few minor problems, though, and it might be useful to work
through them.

One issue is that the closing delimiter is marked at a higher depth
than the opening delimiter. But we can shift its "depth marker" to the
first character that should be hit:

   +/\ (=&'(' - _1 |.!.0 =&')') '(a (nesting (depth (example))))...'
1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 3 2 1 0 0 0

(This is a general issue with positions associated with matching --
when dealing with the end of the match, we need to include the length
of the thing we matched on.)

Another issue has to do with comments and quoted strings and escape
characters. Generally, when dealing with matching delimiters, we are
dealing with code which can have these things.

A fuil solution to this issue might be to rig up a state machine that
classifies characters by whether or not they are syntactically
significant or not. I'll leave that for later (partially because of an
inefficiency in the ;: dyad when used for that purpose, which annoys
me). Or, maybe we tokenize everything and work with tokens rather than
character (if we do this, and we're editing files, we need to make
sure we include every bit of whitespace as tokens, instead of
discarding them, otherwise things become bad).

But let's say that '#' introduces comments (to end of line) and '"'
quotes strings and that we can look for these in a strict order (no
escapes, comment characters do not appear in strings, strings do not
span multiple lines, and the text we are dealing with is all properly
delimited newlines).

Then, we can go like this:
   assert LF={:text
   comments=. (text=LF) < ; <@([: +./\ =&'#');.2 text
   strings=. (text='"') +. ; <@([: ~:/\ =&'"');.2 text
   ham=. -. strings+.comments

Now, let's say we're looking for balanced curly braces:

   open=. ham * text='{'
   close=. _1 |.!.0 ham * text='}'
   depth=. +/\ open-close

We might want to throw in an assertion that our delimiters are balanced.

   assert 0={:depth

So far, so good... but how do we use this kind of thing?  Let's say
that we have a config file with name=value pairs where the values are
either quoted strings or curly brace delimited name value pairs (all
separated by whitespace). And, let's say we're looking for the text
that follows a specific name, and let's say we're going to manipulate
that text -- adding in some new value.  Maybe we got the positions of
those names like this:

   namepos=. I. ' name = ' E. text

First off, since we're going to manipulate the text, and we're using
positions, we want to work forward from the end of the file. So we
need to patch that up:

   namepos=. |. I. ' name = ' E. text

Or, if we had a modularity barrier in the wrong place for that:

   namepos=. \:~ namepos

Second, next, I think we need a loop. And, that gets into ancient
traditions, like hoisting some stuff out of the loop.  (This might not
be necessary if our problem is constrained enough.)

For the approach I'm going to use here, we need to know how deeply
nested our names are:

   nmdepth=. ~. namepos { depth

And, I want to know where the corresponding balanced brace pairs end:

  brends=. (nmdepth +/ 1 0 ) <@I.@E."1 depth

Now I can do my explicit loop (needs to be in an explicit definition,
of course):

  for_name. namepos do.
    lev=. name { depth
    end=. name (I.~{]) (nmdepth i. lev) {:: brends
    section=. name}. end {. text
    echo SAMPLE=. section return.
    NB. do stuff here instead of echo
    text=. text rplc section;newsection
  end.

The key important takeaway here is that you can use dyadic I. to
locate a position in an ordered list which follows some other
position. And, the idiom (I.~ { ]) extracts that following position.

The other thing is more about testing your work. Here, before trying
to modify anything, I would want to see an example of what I'm working
with, so that I could fix any problems that might have cropped up, and
so that I can focus on the details I want to focus on.

Once I have my sample, I can work with it to make sure that I am doing
what I think I want to be doing with it. Once I have that, I could
finish up the code. (And, this is why I wanted a loop there. At least
for the first version of the code.)

Anyways... this is something that I had done quite often in the past,
and I find reaching for this kind of code to be easier than trying to
use regular expressions when manipulating those kinds of config files.

So, maybe someone else might benefit from this. But I've already gone
plenty deep for one email message, I think.

Take care,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] Matching on balanced delimiters

Reply via email to