[basex-talk] Re: aligning sequences of text?

Joel Kalvesmaki via BaseX-Talk Thu, 12 Feb 2026 07:18:41 -0800

Another option is to string-join the sequences via a unique character,then do a straightforward string diff. The XSLT function tan:diff() isefficient and results in high quality. You would then need to do somepost-processing on the results to get your integer pairs. But presumablyyou're getting those integer pairs not as an end in itself but as ameans to some other task, and the output of tan:diff() may get you therequicker. You might also be able to skip what I presume is a preprocessof turning two strings into two sequences of tokenized strings.

tan:diff() is written in XSLT, not XQuery, but you should be able to usefn:transform().


Code:
https://textalign.net/

Background:
https://www.balisage.net/Proceedings/vol26/html/Kalvesmaki01/BalisageVol26-Kalvesmaki01.html

Best wishes,

Joel

On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:

Thank you! I can foresee some brain stretching in my future.

And yes, just two sequences of text, and what should be very similar
text. (I'm trying to write tests for a conversion process.)

-- Graydon

On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:

With just two sequences you can use Needleman-Wunsch. It’s a
dynamic programming algorithm that provides an optimal alignment
(good thing, although there may be more than one optimal alignment),
but it doesn’t scale well (not good thing). I describe an XSLT 3.0
implementation in my 2020 XMLPrague paper at

https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf


Your question doesn’t clarify whether you’re looking for index
numbers in the alignment (where a word in one input might be matched
by a gap in the other) or in the inputs (where aligned words share a
position in the alignment but may have different positions in the
inputs). For either of those interpretations, though, a solution
will begin by finding an alignment.

David J. Birnbaum
[email protected]

On Feb 11, 2026, at 9:41 PM, Graydon Saunders
<[email protected]> wrote:


Hello!

If I have two (fairly long) sequences of text, ('The', 'words',
'are', 'sequence', 'members') and I want all the index numbers of
matching pairs despite the sequences only mostly matching (so a
word, or several words, can be missing from sequence A or sequence
B), is there an established algorithm for doing this?

(If I search on "aligning sequences" I get bioinformatics about
gene sequences; if I search on "aligning text" I get typography.)

Thanks!
Graydon


--
Joel Kalvesmaki
Director, Text Alignment Network
http://textalign.net

[basex-talk] Re: aligning sequences of text?

Reply via email to