Hi, thanks, that would be great. Patches are simply attached to the issue. Non-trivial changes require an ICLA. Do you want to sign and submit it?
Best, Peter Am 07.01.2016 um 10:08 schrieb Mario Gazzo: > Thanks, > > I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 > <https://issues.apache.org/jira/browse/UIMA-4729> > > If you like, then we can also implement it and submit a patch, just let us > know what the process is. > > Cheers > Mario > >> On 07 Jan 2016, at 09:08 , Peter Klügl <peter.klu...@averbis.com> wrote: >> >> Hi, >> >> Am 06.01.2016 um 14:48 schrieb Mario Gazzo: >>> Hi Peter, >>> >>> I had a look at the test cases and I think there are many interesting and >>> useful features that cover many of our use cases but I will have to >>> experiment with them before I know what might be missing. I have a few >>> questions though: >>> >>> 1) It appears that we would then also be able to assign annotations to >>> lists, which is nice. I am not sure from looking at the tests whether it is >>> possible to use ADD with the annotation lists but I assume so. >> Not yet, but I will implement it. It's still work in progress. But >> thanks for pointing it out, I would probably have forgotten about it. >> >>> 2) The use of addresses is unclear to me just from reading the test, maybe >>> you could explain them.? This concept is very new to me. >> It's not intented be to utilized directly in a rule file. It's rather >> just a way to combine logic in java with ruta rules or use ruta >> functionality in java code. >> Let's say we have a new method like >> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations) >> and you call it with something like (syntax is not yet specified) >> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation) >> Then, the "$" would be replaced by the address of the annotation and the >> method would return whether the annotation is covered by a Headline >> annotation and is followed by a Keyword annotation. >> >>> 3) The annotation feature expression looks nice but I wonder whether an >>> array element can also be referenced using an int expression and not just a >>> constant e.g. Struct.as[intVar+1]{->T1}; >> Yes, without allowing number expressions, it would not really be useful. >> The current implementation is just a test in order to check whether the >> internal object model is good enough to cover it. The complete >> functionality will probably not be included in the next release since >> there is still much work left in order to get it up and running. The >> semantics of such expressions (Struct.as) are resolved on the fly, and >> the code odes not support expressions at all. I still have to think >> about a way to implement it. >> >>> The label expressions are also useful and will make some of our rules more >>> readable. >>> >>> Finally I have one additional question to the MARKUP initialisation. I have >>> a case where I need the token seeds coming from the default seeder but I >>> don’t want to run the markup initialisation. Is there a separate seeder >>> defined for this somewhere? Right now I have my own copy of the default >>> seeder without the MARKUP initialisation but obviously I do not want to >>> maintain this. It looks as if they could also be split in two seeders with >>> both added as default and then I could overwrite with my own seeder list >>> containing only the token seeder. >> Yes, we can split them or just add another one that ignores markup. I >> was also always thinking about adding a DetailedSeeder that creates much >> more finegrained types like different brackets and quotes... but it was >> never on top of my todo list. >> >> Do you want to open a jira issue for it? >> >> Best, >> >> Peter >> >>> Cheers >>> Mario >>> >>> >>>> On 04 Jan 2016, at 17:06 , Peter Klügl <peter.klu...@averbis.com> wrote: >>>> >>>> Hi, >>>> >>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo: >>>>> Hi Peter, >>>>> >>>>> No problem, I was anyway pretty much offline myself during Christmas >>>>> holidays. >>>>> >>>>> The term “overhead” is probably an exaggeration in this context >>>>> especially after I disabled the MARKUP initialisation. We implemented >>>>> earlier our own XML markup annotator tailored to better fit our needs >>>>> with additional annotation types and properties, so the Ruta MARKUP is >>>>> currently not used. It just happens that we don’t directly use RutaBasic >>>>> in any of our rules in this particular case so I was curious to know >>>>> whether we could avoid creating them in the first place since there seems >>>>> to be quite a few. However, overall processing required by our Ruta >>>>> scripts compared to other processing steps is now small and >>>>> sub-optimising this further by making RutaBasic optional would currently >>>>> be of very low priority to us. We would prioritise other features higher >>>>> e.g. being able to assign annotations to variables as we discussed >>>>> previously in another thread. >>>> I am working on this right now and there is finally some first progress :-) >>>> >>>> I fear that I won't catch all use cases (combinations with language >>>> elements) with the first attempt. If you are interested (and wanna take >>>> care I do not miss your use case), feel free to take a look at the new >>>> unit tests: >>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation >>>> >>>> It's still work in progress. Proposals for more unit tests are very >>>> welcome. >>>> >>>>> We haven’t processed documents as large as those you mention since books >>>>> have so far been divided into chapters and processing could therefore be >>>>> parallelised accordingly. We also drop extreme outliers above a certain >>>>> size if we encounter them and then we batch process them later in smaller >>>>> chunks but this has so far not been necessary with our current data sets. >>>>> Like you, our processing bottlenecks are now in different components. >>>> Ah, that's nice to hear that ruta is not the bottleneck :-D >>>> >>>> Best, >>>> >>>> Peter >>>> >>>> >>>>> Cheers >>>>> Mario >>>>> >>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <peter.klu...@averbis.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> sorry for the delayed reply. >>>>>> >>>>>> RutaEngine::initializeStream: >>>>>> >>>>>> The special treatment of MARKUPs that causes the increased time required >>>>>> for initialization is just a workaround because I was to lazy to write a >>>>>> working jflex rule. Well, I tried but failed. It shouldn't be hard be to >>>>>> improve this code... I will create an issue for it. When I did the last >>>>>> performance optimization, uima did not check the indexes yet and my test >>>>>> set did not contain markups. >>>>>> >>>>>> Deactivate creation of RutaBasic: >>>>>> Short answer is no. I was already thinking about making RutaBasic >>>>>> optional in future so that the user can configure if they are used. >>>>>> However, right now, they are required for rule inference and make the >>>>>> rule inference "fast" in the first place. RutaBasic is just an internal >>>>>> annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and >>>>>> rules should not match on them at all. >>>>>> >>>>>> Some background information: >>>>>> >>>>>> RutaBasics are used for three things: >>>>>> - store additional information in order to avoid index operations. Some >>>>>> useful conditions would require many index operations, e.g., PARTOF or >>>>>> ENDSWITH. RutaBasic is utilized as a cache what annotations start and >>>>>> end at which position, and which positions are covered by which types. >>>>>> - provide a container to make this information available across analysis >>>>>> engines. Information shared by analysis engine is normally stored in the >>>>>> CAS, e.g. in annotations, (or in external resources). This is the role >>>>>> of RutaBasic. It is not really implemented right now as it should be but >>>>>> I will improve it soon. Then, there is no performance decrease when a >>>>>> pipeline is spammed with small ruta engines. >>>>>> - a basic minimal disjunct partitioning of the document for the coverage >>>>>> based visibility concept. >>>>>> >>>>>> Making RutaBasic optional is possible. If there is a real need for it, >>>>>> e.g., in order to reduce the memory footprint or when processing large >>>>>> documents where parts are simply not interesting, then I will put it on >>>>>> my TODO list. I am also open for other/new ideas how to solve the >>>>>> challenges (and for incremental usage of internal caches). >>>>>> >>>>>> What is your experience with the processing overhead concerning >>>>>> RutaBasic? Is it the rule matching or rather the initialization? I >>>>>> myself had already some performance problems with the initalization and >>>>>> memory consumption in large CAS (500+ pages pdfs). However, other >>>>>> components, serialization and the CAS editor were the actual bottlenecks. >>>>>> >>>>>> Best, >>>>>> >>>>>> Peter >>>>>> >>>>>> >>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>>>>>> I got around it by removing the default seeders by specifying an empty >>>>>>> seeders list since we don’t need the MARKUP annotations anymore. >>>>>>> >>>>>>> I still don’t know why it created so much overhead but it sometimes >>>>>>> seemed to rival the POS tagger in processing time. >>>>>>> >>>>>>> Anyway, this leads me to the next question. Can I disable the creation >>>>>>> of Ruta basic annotations entirely to save processing overhead and only >>>>>>> apply Ruta rules to other annotation types created by other AEs such as >>>>>>> our own? >>>>>>> >>>>>>> Cheers >>>>>>> Mario >>>>>>> >>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Peter, >>>>>>>> >>>>>>>> I noticed that occasionally the initialisation in >>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t really >>>>>>>> explain them and it seems independent of document length since I have >>>>>>>> seen this with even very small XML documents. >>>>>>>> >>>>>>>> The method seems to spend much time in the DefaultSeeder when creating >>>>>>>> MARKUP annotations during subiterator.moveToNext calls (line 89) and >>>>>>>> inside Subiterator it seems to be the while loop inside >>>>>>>> adjustForStrictForward (line 232), which is inside UIMA core classes. >>>>>>>> I haven’t gone into any deeper analysis yet but I first like to hear >>>>>>>> whether you have an idea what could be the main cause(s) for this? >>>>>>>> >>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>>>>>> >>>>>>>> >>>>>>>> Cheers >>>>>>>> Mario >