What I'm trying to do is to test the result of an external conversion process.

BaseX's implementation of full text makes it easy and fast to get a count of 
words; the next part of the problem is locating the "of" responsible for before 
having 1182 instances and after having 1181 instances. My idea was to try to 
find the sequence position numbers which don't have matches and to use that to 
get the words-before and words-after. (Doing a relate-elements-before-and-after 
word-for-word compare has been judged impractical; there's too much change in 
the "this is section 2" identifiers for doing that to be less effort that doing 
the conversion in-house which has already been decided against.)

Stuffing the full text into tan:diff() might be a better approach; I will 
certainly look into it.

Thank you!
Graydon

On Thu, Feb 12, 2026, at 10:17, Joel Kalvesmaki wrote:
> Another option is to string-join the sequences via a unique character, 
> then do a straightforward string diff. The XSLT function tan:diff() is 
> efficient and results in high quality. You would then need to do some 
> post-processing on the results to get your integer pairs. But presumably 
> you're getting those integer pairs not as an end in itself but as a 
> means to some other task, and the output of tan:diff() may get you there 
> quicker. You might also be able to skip what I presume is a preprocess 
> of turning two strings into two sequences of tokenized strings.
> 
> tan:diff() is written in XSLT, not XQuery, but you should be able to use 
> fn:transform().
> 
> Code:
> https://textalign.net/
> 
> Background:
> https://www.balisage.net/Proceedings/vol26/html/Kalvesmaki01/BalisageVol26-Kalvesmaki01.html
> 
> Best wishes,
> 
> Joel
> 
> On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:
> > Thank you! I can foresee some brain stretching in my future.
> > 
> > And yes, just two sequences of text, and what should be very similar
> > text. (I'm trying to write tests for a conversion process.)
> > 
> > -- Graydon
> > 
> > On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
> > 
> >> With just two sequences you can use Needleman-Wunsch. It’s a
> >> dynamic programming algorithm that provides an optimal alignment
> >> (good thing, although there may be more than one optimal alignment),
> >> but it doesn’t scale well (not good thing). I describe an XSLT 3.0
> >> implementation in my 2020 XMLPrague paper at
> >> 
> > https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf
> >> 
> >> Your question doesn’t clarify whether you’re looking for index
> >> numbers in the alignment (where a word in one input might be matched
> >> by a gap in the other) or in the inputs (where aligned words share a
> >> position in the alignment but may have different positions in the
> >> inputs). For either of those interpretations, though, a solution
> >> will begin by finding an alignment.
> >> 
> >> David J. Birnbaum
> >> [email protected]
> >> 
> >>> On Feb 11, 2026, at 9:41 PM, Graydon Saunders
> >>> <[email protected]> wrote:
> >> 
> >>> 
> >>> Hello!
> >>> 
> >>> If I have two (fairly long) sequences of text, ('The', 'words',
> >>> 'are', 'sequence', 'members') and I want all the index numbers of
> >>> matching pairs despite the sequences only mostly matching (so a
> >>> word, or several words, can be missing from sequence A or sequence
> >>> B), is there an established algorithm for doing this?
> >>> 
> >>> (If I search on "aligning sequences" I get bioinformatics about
> >>> gene sequences; if I search on "aligning text" I get typography.)
> >>> 
> >>> Thanks!
> >>> Graydon
> 
> -- 
> Joel Kalvesmaki
> Director, Text Alignment Network
> http://textalign.net
> 

Reply via email to