On 11. Nov 2020, at 15:37, Mario Juric <[email protected]> wrote:
> 
> Would it makes sense to provide CAS files based on real documents for the 
> benchmarks? I mean, we could run segmenters on some OpenAccess documents, map 
> the annotations to those used by the tests, and store them somewhere as XMI 
> or CAS binaries. The test could then load them during the initialization 
> phase before benchmarking.

That would be an option. But I think I would rather factor out the 
initialization of the
CASes in the benchmark such that different initialization strategies can be 
plugged in.
Then implementing a random CAS generator which would generate CASes according 
to a few
rule such as:

- partition the CAS into sentences with a length between 20 and 200 with a kind 
of a
  bell distribution of sentence length
- partition the sentences into tokens between 2 to 20 chars also with a bell
  distribution

That would be a bit more realistic for a benchmark like "find the sentence 
covering
the current token".

But actually, despite being a bit degenerated, I believe the current benchmark 
setup
could already be used to profile the code and see if / what might could 
possibly be
improved (if anything).

Conceptually, both uimaFIT selectCovered(Token, X) and SelectFS 
select(Token).coveredBy(X) work in the same way

1) seek to X in the annotation index using a binary search
2) linear search backwards in the index to the first match by offset (to ignore 
type prios)
3) iterate forward until the end of the selection window is reached

The main difference is that selectCovered does already collect all annotations 
it finds into a list in step 3
while SelectFS allows to stream over the annotations - although probably people 
will use SelectFS with .asArray()
or .asList() which again makes it equivalent to the uimaFIT approach. That is 
why in the benchmark, I put a .forEach()
at the end of the selection to make it a bit more comparable.

-- Richard

Reply via email to