Hi Randall,

Thanks for the feedback, and apologies for not replying sooner.

> With what you know from this implementation and the research into
> algorithms you've done for it, I'm wondering if you have any early
> intuitions about API design questions I've been pondering.

In terms of the API design, I have learned a few things based on my prototyping 
so far:

- Even with very efficient and well-optimised algorithms, the process of 
anchoring hundreds of quotes in a document with 100s of KB of text can take 
100s of milliseconds (or even seconds if the document is a large book), so the 
API should indeed be asynchronous.

- There are opportunities for efficiency gains by anchoring a batch of 
selectors at once, rather than one at a time. One easy win is to avoid building 
up the mapping between DOM nodes and text offsets multiple times but instead 
built it only once and re-use it while searching for multiple quotes. Searching 
for multiple patterns at a time may also offer speed improvements, although it 
depends on the error rate allowed (proportion of mismatch between the quote and 
the text). The current code on GitHub builds up the DOM => text offset mapping 
once and then searches for each quote one at a time. I have locally done some 
experiments with searching for multiple patterns at once and was able to obtain 
a speedup of 30-40%, but this involved some careful tuning of various 
parameters for the particular dataset. It isn’t obvious yet whether this can be 
made robust enough to deliver a benefit when the text/selector set is not known 
in advance.

- I would like to avoid hanging the browser UI during the anchoring process, 
therefore any step which can take more than say ~100ms would ideally be either 
incremental or possible to do in a worker. As long as the API is async, they 
don’t have to care which of these methods is used. Building the DOM => text 
offset mapping has to be done on the main thread. Once that is done, the actual 
pattern-matching step can be done in a worker.

- Since the process can take > 500ms, cancellation might be well useful, 
although I haven’t implemented this yet.

- An important feature of the prototypes is that they allow for customisable 
normalisation of the text and quotes before matching, together with mapping the 
matches in the normalised text back to locations in the original text. 
Normalization can include case-folding, ignoring certain characters etc. The 
motivation for this is primarily to help with anchoring annotations following a 
forthcoming PDF.js upgrade in the Hypothesis client / Via / browser extension 
which will significantly affect the way text is extracted for some PDFs.

> I'd love to find some alignment that allows you to bring this work into the
> tree and invite you to contribute to directly.

I’d like to move the code to a shared space in future. So far I’ve been focused 
on prototyping to learn about the problem space and figuring out how to meet 
Hypothesis’s product needs. There is still quite a bit of that to do. I will 
try to familiarise myself with the work that has already been done in the 
Apache Annotator repo to understand how this work might fit into what is there 
so far.

> I'd also consider adopting typescript, if that is appealing to the group.
> That's probably a very easy change to our babel setup.

I do think the group should consider this. I’ve found that types offer 
significant benefits for readability and maintainability of libraries, and it 
benefits consumers in terms of providing machine-readable interfaces (for 
consumers using TypeScript or Flow, code completion in IDEs etc., or even just 
humans reading the source directly)

Kind Regards,
Robert.


> On 14 Jun 2019, at 14:29, Randall Leeds <rand...@apache.org> wrote:
> 
> This is great stuff, Robert. Thanks for sharing.
> 
> On Tue, May 28, 2019, 06:10 Robert Knight <robertkni...@hypothes.is> wrote:
> 
>> 
>> At the moment this is still in the experimentation phase, but if it pans
>> out well, I think they could be a good fit for the Apache Annotator
>> project. The string matching implementation is fairly solid and offers
>> significant performance improvements over diff-match-patch (see the README
>> of the anchor-quote project for some early benchmarks). The quote anchoring
>> library is very much in early development.
> 
> 
> With what you know from this implementation and the research into
> algorithms you've done for it, I'm wondering if you have any early
> intuitions about API design questions I've been pondering.
> 
> So far, I've been ensuring that the selector API is incremental and
> asynchronous. I've also made sure to support returning multiple matches.
> For these reasons the selectors are asynchronous generators.
> 
> I have not explicitly addressed sharing state between selectors. Some
> algorithms may benefit from doing a parallel search for multiple patterns
> in a single pass. I noticed your API accepts multiple selectors, although
> it seems to treat them independently. I was wondering what considerations
> you had in mind when you made that API design decision.
> 
> The API on master here currently separates the parse operation that takes a
> selector string/object, returning a scoped match function. This provides an
> opportunity to close over shared state held by the parser/constructor.
> 
> The parse step does not initially scope the selector to any context. It's
> not possible to know, until the match function is invoked, that multiple
> selectors are applicable to the same scope. I suspect this concern can be
> lifted by binding the parser to a root scope. Therefore, I don't think
> there's any action here, but I figured I'd mention it.
> 
> Another thing which I have considered and, again, so far made no explicit
> affordance for, is threading a deadline or abort controller through the
> call stack so that match operations can be cancelable.
> 
> If you have any strong reactions to these thoughts, I'd be curious to hear
> them.
> 
> I'd love to find some alignment that allows you to bring this work into the
> tree and invite you to contribute to directly.
> 
> I'd also consider adopting typescript, if that is appealing to the group.
> That's probably a very easy change to our babel setup.

Reply via email to