On 11. Nov 2020, at 15:37, Mario Juric <[email protected]> wrote: > > Would it makes sense to provide CAS files based on real documents for the > benchmarks? I mean, we could run segmenters on some OpenAccess documents, map > the annotations to those used by the tests, and store them somewhere as XMI > or CAS binaries. The test could then load them during the initialization > phase before benchmarking.
That would be an option. But I think I would rather factor out the initialization of the CASes in the benchmark such that different initialization strategies can be plugged in. Then implementing a random CAS generator which would generate CASes according to a few rule such as: - partition the CAS into sentences with a length between 20 and 200 with a kind of a bell distribution of sentence length - partition the sentences into tokens between 2 to 20 chars also with a bell distribution That would be a bit more realistic for a benchmark like "find the sentence covering the current token". But actually, despite being a bit degenerated, I believe the current benchmark setup could already be used to profile the code and see if / what might could possibly be improved (if anything). Conceptually, both uimaFIT selectCovered(Token, X) and SelectFS select(Token).coveredBy(X) work in the same way 1) seek to X in the annotation index using a binary search 2) linear search backwards in the index to the first match by offset (to ignore type prios) 3) iterate forward until the end of the selection window is reached The main difference is that selectCovered does already collect all annotations it finds into a list in step 3 while SelectFS allows to stream over the annotations - although probably people will use SelectFS with .asArray() or .asList() which again makes it equivalent to the uimaFIT approach. That is why in the benchmark, I put a .forEach() at the end of the selection to make it a bit more comparable. -- Richard
