Thanks Peter, No problem with the delay. I was on vacation myself, and sometimes it is just necessary to pull the plug :)
I am just happy that you take the time to answer my questions, and I think your answers help making sense to this. I now have some ideas that I can experiment with to see what works, but it’s possible to use RutaBasic when optional spaces are included in the rules, although it gets more awkward. I would still prefer to avoid this and having a type-based rule-logic feature would makes sense in our case. Shall I create a feature request for this? I wouldn’t expect you to do this any time soon, but let me know if there is something I could help out with when the time comes. Cheers, Mario > On 18 Oct 2019, at 10:10 , Peter Klügl <[email protected]> wrote: > > Hi, > > > sorry for the delayed reply. > > > comments below... > > > Am 09.10.2019 um 22:19 schrieb Mario Juric: >> Hi Peter, >> >> Thanks a lot for the answer. >> >> I am still trying to wrap my head around this, and I understand the issues >> at play when dealing with a generic rule engine, since I am looking at an >> isolated case only. I was just thinking that in my particular case the >> covering annotation starts before matching 'Dog Cat’, so why would its >> ending right before Cat prevent the rule from firing? It doesn’t follow Dog, >> and a rule like “Dog Covering {->MARK(CHASE)}” wouldn’t therefore be matched >> either, but I understand now that it is enough that something else being >> present in this area between the two rule elements is enough for the match >> to fail. However, as you describe, the presence of SPACE annotations and a >> rule like Dog SPACE Cat { -> MARK(CHASE)} would succeed in matching despite >> the presence of the covering annotation. > > > The main thing here is probably the requirement that the logic for > applying the visibility concept should always be symmetric, meaning it > should be the same regardless if the rule matches from left to right or > from right to left (or inside out). > > In your example, the rule matches from left to right (I assume), so that > behavior that the last space is not skipped is not intuitive at all. > However, if the rule would match for some reason from right to left, > e.g., because of dynamic anchoring or a manual anchor, then the > inference would detect a starting Covering annotation as the next > possible position, which is not invisible (since there is nothing at all > invisible). So there would actually be something that could be matched, > but it is not the correct type (Dog). > > I do not know if this explanation makes sense... it's easier with a > whiteboard ;-) > > > >> Have you ever described the implementation of the matching in some paper or >> similar? I would be interested to have a look at it, but maybe it’s better >> just to have a go at the code? I would certainly prefer reading a high level >> abstract specification first though :) > > > The last paper is the NLE journal article, which contains some high > level description of the algorithm. However, this is some really > specific functionality for a specific scenario. So, if I write a new > paper, it will most likely not cover this. > > >> >> Generally I cannot just trim the annotations in the real application, since >> some of these whitespaces are included in the marking for various reasons. I >> therefore played around with type filtering, since I was hoping that the >> type filter would allow me to match the rules while ignoring any presence of >> filtered types. I was again surprised to find out that filtering the >> Covering type while retaining Cat and Dog would in this case just prevent >> anything from being matched, because it seems to make all those text parts >> invisible where the filtered types appear, no matter if they cover any >> retained annotation types. So this didn’t seem to solve my problem either, >> although I could of course try to mark those areas I otherwise would >> consider trimming and include those in the rules like a space or filter on >> them, which I guess is what you suggested. It suddenly just becomes somewhat >> awkward though, and it may just be more clear to use RutaBasic with the >> rules instead. > > > Yes, the visibility concept in Ruta is not type-based but type > coverage-based (and I think that's really cool) > > It is possible to extend the functionality to additionally support > type-based logic, but I do not know when this would be ready. > > I would not recommend to use RutaBasic in the rules (I actually do not > know right now, if it would work), but if you do, then you should > probably deactivate the "empty is invisible" option. > > > Best, > > > Peter > > >> >> Cheers, >> Mario >> >> >> >> >> >> >> >> >> >> >> >> >> >>> On 9 Oct 2019, at 09:35 , Peter Klügl <[email protected]> wrote: >>> >>> Hi Mario, >>> >>> >>> I need to take a closer look as this is not the usual scenario :-) >>> >>> >>> However, without testing, I would assume that the second rule does not >>> match because the space between dog and cat is not "empty". >>> >>> >>> Normally, you have a complete partitioning provided by the seeding which >>> causes the RutaBasic annotations. If there are only a few annotations, >>> then there needs to be a decision if a text position is visible or not >>> (as you have no SPACE, BREAK and MARKUP annotation). You would expect >>> that the space between the annotations is ignored, but there is actually >>> no reason why Ruta should do that, as there is no information at all >>> that it should be ignored (... generic system, you might want to write >>> rules for whitespaces...). In order to avoid this problem in such >>> situations there is the option to define empty RutaBasics as invisible. >>> That are text position where no annotation begins or ends (and not >>> covered by annotations) AFAIR and sequential matching could not match at >>> all anyway. Thus, the first space is ignored, but the not the second, >>> because the Covering annotation ends there. >>> >>> >>> Does that make sense? >>> >>> >>> I think there are many option how your rules can become more robust, but >>> that depends on your complete system/pipeline. Is it an option to trim >>> annotations in order to avoid whitespaces at the beginning or ending? Is >>> it easy to identify these positions? You could create an annotation >>> there and filter it the type. >>> >>> >>> >>> Best, >>> >>> >>> Peter >>> >>> >>> >>> Am 07.10.2019 um 10:21 schrieb Mario Juric: >>>> Hi Peter, >>>> >>>> I have a script that is executed without any seeders for performance >>>> reasons, and we don’t need the seeded annotations in that case. I have an >>>> issue involving annotation elements that partially cover the rule elements >>>> of interest, and I do not have a simple solution for it, so I have a >>>> question about the match semantics. Let me explain it using a simple >>>> example and the text ‘cat dog cat’. >>>> >>>> Assume the following 4 annotation types and 2 rule statements: >>>> >>>> DECLARE Covering; >>>> DECLARE Cat; >>>> DECLARE Dog; >>>> DECLARE CHASE; >>>> Cat Dog { -> MARK(CHASE)}; >>>> Dog Cat { -> MARK(CHASE)}; >>>> Assume prior to script execution the following annotations with beginnings >>>> and endings: >>>> >>>> Cat[0,3[ >>>> Dog[4,7[ >>>> Cat[8,11[ >>>> Covering[0,8[ >>>> >>>> The Covering annotation is an example of the disturbing element that I >>>> observed, which has nothing or little to do with what I am trying to >>>> match. It just happens to be there for a reason unrelated to these rules, >>>> but it causes the second rule not to match when I expected it. Only the >>>> first rule fires, but the second will also fire when I change Covering >>>> bounds to [0,7[ though. >>>> >>>> The order in which elements are matched seems very different from how they >>>> are usually selected from the CAS index, where you would get 'Covering Cat >>>> Dog Cat’, and with this order you would intuitvely expect both rules to >>>> match. This would probably be overly simplified though, since I would not >>>> be able to match adjacent covering annotations this way, so I believe >>>> matching is somehow based on edge detection. Sill, I have difficulties to >>>> understand why that extra covering space makes a difference. >>>> >>>> I was hoping you could provide me with some details, and I also like to >>>> know what possible workaround options I have. I was considering playing >>>> around with type filtering, but it would require a bit of adding/removing >>>> types to be filtered during the script, so it didn’t seem as the simplest >>>> solution. Ensuring that covering always aligns with the end of a token is >>>> another possibility in this particular case, but I still need to add >>>> general robustness to the Ruta script against these scenarios. Any >>>> feedback is mostly appreciated, thanks :) >>>> >>>> Cheers, >>>> Mario >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> -- >>> Dr. Peter Klügl >>> R&D Text Mining/Machine Learning >>> >>> Averbis GmbH >>> Salzstr. 15 >>> 79098 Freiburg >>> Germany >>> >>> Fon: +49 761 708 394 0 >>> Fax: +49 761 708 394 10 >>> Email: [email protected] >>> Web: https://averbis.com >>> >>> Headquarters: Freiburg im Breisgau >>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >>> >> > -- > Dr. Peter Klügl > R&D Text Mining/Machine Learning > > Averbis GmbH > Salzstr. 15 > 79098 Freiburg > Germany > > Fon: +49 761 708 394 0 > Fax: +49 761 708 394 10 > Email: [email protected] <mailto:[email protected]> > Web: https://averbis.com <https://averbis.com/> > > Headquarters: Freiburg im Breisgau > Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 > Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
