Re: DOM Iteration (was Re: Just a simple example?)

Sasha Goodman Fri, 12 May 2017 12:20:29 -0700

I would be delighted if my efforts were useful in this project!!!
Regarding that code, if any parts are used it would make my week. The class
structure is sorta self-documented by the standard, and combined with
builders the classes it can accommodate a variety of motives.


Highlighting is the most common motive now (correct me if I'm wrong). My
gut-feeling is that to get the support and time of hard core annotators,
the code needs to accommodate the idiosyncrasies of highlighting first. For
example, if there are thousands of highlights on a page, an annotation
builder might iterate/walk the document just once and fill in the thousands
of highlights in one pass. Also, a highlighting app would probably need to
modify the source document by inserting spans and such.

If Randall needs familiar code for node iteration, tree walking, range
splitting, string similarity and normalization, that's cool! Custom code,
*especially* Polyfill type implementations, could smooth over browser
idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom
walkers on par with the native browser based ones.

On a personal note, I do archival work and did not initially see the value
in modifying the source document by inserting spans (however, a highlight
app would need that). The main reason I'm excited about annotation is its
value for labeling data for text analysis and machine learning. A lot of
the advancements in machine learning are because of large bodies of data
that have been tagged. The most common examples are usually of images that
have regions selected and then labeled, but annotation could also help turn
semi-structured text into more structured text data (e.g. for labeling
parts of government documents). For archival work on mostly static
documents, there does not seem to be a need to modify source document. On
the other hand, for dynamically changing documents, inserting spans with
unique IDs seems appropriate because its more robust to document changes.
Yet, it is also vulnerable to turf battles with other extensions and the
page's own javascript, so I hope it's not a requirement of the Apache
library but rather a feature.


On Thu, May 11, 2017 at 1:43 PM Benjamin Young <[email protected]>
wrote:

> Exciting to see this conversation happening. ^_^
>
>
> Randall, how feasible would it be to bring (soon) your libraries (even via
> copy/paste) into the Apache Annotator repo. I believe (according to GitHub)
> you're author/owner of 90%+ of the code in them, and (consequently) able to
> do that if you believe that's the right step.
>
>
> Sasha you're classes modeled around the selector and a "builder" sound
> very similar to the hopes I wrote up in
> https://cwiki.apache.org/confluence/display/ANNO/Planning
>
>
> I'd very much like to combine these efforts in some way.
>
>
> Additionally--and the thing driving me personally at the moment--I have to
> present on Apache Annotator next Wednesday!
>
> https://apachecon2017.sched.com/event/AbBW
>
>
> Consequently, I'd very much love it if we (collectively) could build a
> demo together! There's plenty to talk about wrt to annotation, community
> building, Web Annotation Data Model & Protocol, as well as why (those of us
> that are here at least) have chosen to start collaborating at the ASF.
>
>
> At any rate, I plan to be coding on all the things leading up to
> Wednesday, so any help, input, pointers, and code (hehe) that anyone wants
> to toss in ahead of my codez, I'd be most grateful to code together!
>
>
> Thanks, all!
>
> Benjamin
>
> --
>
> http://bigbluehat.com/
>
> http://linkedin.com/in/benjaminyoung
>
> ________________________________
> From: Randall Leeds <[email protected]>
> Sent: Thursday, May 11, 2017 3:34:24 PM
> To: [email protected]
> Subject: DOM Iteration (was Re: Just a simple example?)
>
> Great to see you here, Sasha!
>
> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <[email protected]>
> wrote:
>
> >
> > P.S. This afternoon I streamlined the TextQuoteSelector and
> > TextPositionSelector to work (in principle ) consistently with Randall
> > Leed's implementation that used NodeIterator and textContents.
> >
> >
> Neat :).
>
> I think my takeaway from the simple example thread, and something of which
> many of us were likely already well aware, is that there's a desire for a
> good highlighter implementation. A way to highlight text is often the first
> example people want to see.
>
> While I hope to see experimentation with implementations that try to limit
> the impact on the DOM, I think <mark> or <span> wrapping of text nodes is
> still the easiest to understand. In this approach, the actual wrapping is
> easy. The difficult part is iteration.
>
> Now, some quick background on node iteration.
>
> I chose to use NodeIterator rather than TreeWalker for my dom-seek library
> because it meant that the seek function could be stateless, support seeking
> forward and backward, and still be able to return the number of characters
> consumed by a seek. The desire to know whether to include the current
> node's content in the seek count is fulfilled by NodeIterator's
> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a point
> before or after a node, rather than simply a current node.
>
> However, using NodeIterator to traverse a Range is not really great. Since
> it has a read only currentNode, the best that can be done is to start with
> the commonAncestorContainer of the Range. Range has compareNode,
> comparePoint, and isPointInRange. I have no idea how expensive these are.
> Iterating all the nodes under the commonAncestorContainer doesn't feel
> great to begin with. TreeWalker might be more appropriate since its
> currentNode could be set to startContainer directly. TreeWalker also
> appears to have consistent platform support.
>
> All of this is complicated by the Range being able to point to offsets
> within text nodes. For the purposes of highlighting with wrapper elements
> it's necessary to split the boundary nodes. I think there are probably a
> number of libraries for this, but I propose we write one under our repo.
>
> We might also find that normalizing the endpoints of a Range in some
> fashion is a helpful prerequisite. There is a library I found that does
> this, but I found its algorithm terribly confusing. I put time into
> rewriting it without dependencies. Despite some initial excitement, the
> author never fully vetted and accepted my pull request:
> https://github.com/webmodules/range-normalize/pull/2
>
> In conclusion, I think there'd be value in bringing some functional
> utilities into Apache Annotator for dealing with iteration, range
> splitting, and range normalization, with the goal of providing a very
> succinct and simple highlighter that looks like this:
>
> ```
> for (const node of textNodes(range)) {
>   const mark = document.createElement('mark');
>   node.replaceWith(mark);
>   mark.appendChild(node);
> }
> ```
>
> Some care needs to be taken that whatever iteration we use is not
> invalidated by the replacement of the text node with its wrapper.
>
> The fact that a simple example like this is hard to produce is evidence of
> the underlying complexity described in the above paragraphs. When I see
> people wanting a simple highlighter what I hear is that they actually need
> simple abstractions upon which to build a highlighter. The highlighter
> itself should be easy. Often, highlighters that projects provide are not
> shipped standalone or don't do exactly what the author needs (use spans
> instead of marks, add a particular class, coalesce overlapping highlights
> or not, etc). There is lots of room to do different things but being able
> to simply get the nodes to be highlighted is the prerequisite task that
> contains most of the complexity.
>
> That's all (and probably way too much) for now. Finding all the tools for
> all these things is a pain enough that I think we should have a
> comprehensive set of such utilities in Apache Annotator, even if that risks
> looking like a bit of NIH syndrome.
>
> Unless anyone objects, I think I'll aim to ship libraries for these:
> - Node iteration (https://github.com/tilgovi/dom-node-iterator)
> - Tree walking (might not need a library if support is good)
> - Range splitting
> - Range normalization (see my pull request reference, above)
> - Range iterating
> - Text distance (https://github.com/tilgovi/dom-seek)
>
> If anyone wants to start on any of the above, you're welcome to depend on
> libraries that are outside Apache Annotator. In the case of libraries that
> I've written, there is value to bringing them into Apache Annotator because
> they are all written in ES6 but not packaged to be consumed as ES6.
> Bringing them inside our repo means better code deduplication by tree
> shaking in tools like rollup and webpack. They could be packaged as ES6
> where they are, but if I'm going to spend time improving the packaging I
> would rather just toss out the packaging and get the benefits of the
> monorepo having all that build/test boilerplate done once for all of them.
>

Re: DOM Iteration (was Re: Just a simple example?)

Reply via email to