Here's the description of the UIMA site: https://uima.apache.org/get-involved.html
Here's the description of general apache process: http://www.apache.org/dev/new-committers-guide.html#cla A short summary of what is to do: - complete the ICLA (http://www.apache.org/licenses/icla.pdf), print it, sign it and scan it - maybe do the same for the CCLA (http://www.apache.org/licenses/cla-corporate.txt) if your employer requires it and you did the contribution/implementation during work time - send the scanned document (or both) to secret...@apache.org "apache id" and "notify project" are optional but I would add it (so that we get informed that the documents have been processed, and you already have an id in case you would gain comitter rights). I hope I have not forgotten something... Best, Peter Am 07.01.2016 um 10:22 schrieb Mario Gazzo: > Yes, where do we sign this? > > :-) > >> On 07 Jan 2016, at 10:16 , Peter Klügl <peter.klu...@averbis.com> wrote: >> >> :-) let me know if you need help or have any questions. >> >> Am 07.01.2016 um 10:12 schrieb Mario Gazzo: >>> Yes, let us just sign and submit it. >>> >>>> On 07 Jan 2016, at 10:11 , Peter Klügl <peter.klu...@averbis.com> wrote: >>>> >>>> Hi, >>>> >>>> thanks, that would be great. Patches are simply attached to the issue. >>>> Non-trivial changes require an ICLA. Do you want to sign and submit it? >>>> >>>> Best, >>>> >>>> Peter >>>> >>>> >>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo: >>>>> Thanks, >>>>> >>>>> I just added the JIRA issue: >>>>> https://issues.apache.org/jira/browse/UIMA-4729 >>>>> <https://issues.apache.org/jira/browse/UIMA-4729> >>>>> >>>>> If you like, then we can also implement it and submit a patch, just let >>>>> us know what the process is. >>>>> >>>>> Cheers >>>>> Mario >>>>> >>>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <peter.klu...@averbis.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo: >>>>>>> Hi Peter, >>>>>>> >>>>>>> I had a look at the test cases and I think there are many interesting >>>>>>> and useful features that cover many of our use cases but I will have to >>>>>>> experiment with them before I know what might be missing. I have a few >>>>>>> questions though: >>>>>>> >>>>>>> 1) It appears that we would then also be able to assign annotations to >>>>>>> lists, which is nice. I am not sure from looking at the tests whether >>>>>>> it is possible to use ADD with the annotation lists but I assume so. >>>>>> Not yet, but I will implement it. It's still work in progress. But >>>>>> thanks for pointing it out, I would probably have forgotten about it. >>>>>> >>>>>>> 2) The use of addresses is unclear to me just from reading the test, >>>>>>> maybe you could explain them.? This concept is very new to me. >>>>>> It's not intented be to utilized directly in a rule file. It's rather >>>>>> just a way to combine logic in java with ruta rules or use ruta >>>>>> functionality in java code. >>>>>> Let's say we have a new method like >>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations) >>>>>> and you call it with something like (syntax is not yet specified) >>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation) >>>>>> Then, the "$" would be replaced by the address of the annotation and the >>>>>> method would return whether the annotation is covered by a Headline >>>>>> annotation and is followed by a Keyword annotation. >>>>>> >>>>>>> 3) The annotation feature expression looks nice but I wonder whether an >>>>>>> array element can also be referenced using an int expression and not >>>>>>> just a constant e.g. Struct.as[intVar+1]{->T1}; >>>>>> Yes, without allowing number expressions, it would not really be useful. >>>>>> The current implementation is just a test in order to check whether the >>>>>> internal object model is good enough to cover it. The complete >>>>>> functionality will probably not be included in the next release since >>>>>> there is still much work left in order to get it up and running. The >>>>>> semantics of such expressions (Struct.as) are resolved on the fly, and >>>>>> the code odes not support expressions at all. I still have to think >>>>>> about a way to implement it. >>>>>> >>>>>>> The label expressions are also useful and will make some of our rules >>>>>>> more readable. >>>>>>> >>>>>>> Finally I have one additional question to the MARKUP initialisation. I >>>>>>> have a case where I need the token seeds coming from the default seeder >>>>>>> but I don’t want to run the markup initialisation. Is there a separate >>>>>>> seeder defined for this somewhere? Right now I have my own copy of the >>>>>>> default seeder without the MARKUP initialisation but obviously I do not >>>>>>> want to maintain this. It looks as if they could also be split in two >>>>>>> seeders with both added as default and then I could overwrite with my >>>>>>> own seeder list containing only the token seeder. >>>>>> Yes, we can split them or just add another one that ignores markup. I >>>>>> was also always thinking about adding a DetailedSeeder that creates much >>>>>> more finegrained types like different brackets and quotes... but it was >>>>>> never on top of my todo list. >>>>>> >>>>>> Do you want to open a jira issue for it? >>>>>> >>>>>> Best, >>>>>> >>>>>> Peter >>>>>> >>>>>>> Cheers >>>>>>> Mario >>>>>>> >>>>>>> >>>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <peter.klu...@averbis.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo: >>>>>>>>> Hi Peter, >>>>>>>>> >>>>>>>>> No problem, I was anyway pretty much offline myself during Christmas >>>>>>>>> holidays. >>>>>>>>> >>>>>>>>> The term “overhead” is probably an exaggeration in this context >>>>>>>>> especially after I disabled the MARKUP initialisation. We implemented >>>>>>>>> earlier our own XML markup annotator tailored to better fit our needs >>>>>>>>> with additional annotation types and properties, so the Ruta MARKUP >>>>>>>>> is currently not used. It just happens that we don’t directly use >>>>>>>>> RutaBasic in any of our rules in this particular case so I was >>>>>>>>> curious to know whether we could avoid creating them in the first >>>>>>>>> place since there seems to be quite a few. However, overall >>>>>>>>> processing required by our Ruta scripts compared to other processing >>>>>>>>> steps is now small and sub-optimising this further by making >>>>>>>>> RutaBasic optional would currently be of very low priority to us. We >>>>>>>>> would prioritise other features higher e.g. being able to assign >>>>>>>>> annotations to variables as we discussed previously in another thread. >>>>>>>> I am working on this right now and there is finally some first >>>>>>>> progress :-) >>>>>>>> >>>>>>>> I fear that I won't catch all use cases (combinations with language >>>>>>>> elements) with the first attempt. If you are interested (and wanna take >>>>>>>> care I do not miss your use case), feel free to take a look at the new >>>>>>>> unit tests: >>>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation >>>>>>>> >>>>>>>> It's still work in progress. Proposals for more unit tests are very >>>>>>>> welcome. >>>>>>>> >>>>>>>>> We haven’t processed documents as large as those you mention since >>>>>>>>> books have so far been divided into chapters and processing could >>>>>>>>> therefore be parallelised accordingly. We also drop extreme outliers >>>>>>>>> above a certain size if we encounter them and then we batch process >>>>>>>>> them later in smaller chunks but this has so far not been necessary >>>>>>>>> with our current data sets. Like you, our processing bottlenecks are >>>>>>>>> now in different components. >>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Peter >>>>>>>> >>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Mario >>>>>>>>> >>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <peter.klu...@averbis.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> sorry for the delayed reply. >>>>>>>>>> >>>>>>>>>> RutaEngine::initializeStream: >>>>>>>>>> >>>>>>>>>> The special treatment of MARKUPs that causes the increased time >>>>>>>>>> required for initialization is just a workaround because I was to >>>>>>>>>> lazy to write a working jflex rule. Well, I tried but failed. It >>>>>>>>>> shouldn't be hard be to improve this code... I will create an issue >>>>>>>>>> for it. When I did the last performance optimization, uima did not >>>>>>>>>> check the indexes yet and my test set did not contain markups. >>>>>>>>>> >>>>>>>>>> Deactivate creation of RutaBasic: >>>>>>>>>> Short answer is no. I was already thinking about making RutaBasic >>>>>>>>>> optional in future so that the user can configure if they are used. >>>>>>>>>> However, right now, they are required for rule inference and make >>>>>>>>>> the rule inference "fast" in the first place. RutaBasic is just an >>>>>>>>>> internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and >>>>>>>>>> RutaFrame, and rules should not match on them at all. >>>>>>>>>> >>>>>>>>>> Some background information: >>>>>>>>>> >>>>>>>>>> RutaBasics are used for three things: >>>>>>>>>> - store additional information in order to avoid index operations. >>>>>>>>>> Some useful conditions would require many index operations, e.g., >>>>>>>>>> PARTOF or ENDSWITH. RutaBasic is utilized as a cache what >>>>>>>>>> annotations start and end at which position, and which positions are >>>>>>>>>> covered by which types. >>>>>>>>>> - provide a container to make this information available across >>>>>>>>>> analysis engines. Information shared by analysis engine is normally >>>>>>>>>> stored in the CAS, e.g. in annotations, (or in external resources). >>>>>>>>>> This is the role of RutaBasic. It is not really implemented right >>>>>>>>>> now as it should be but I will improve it soon. Then, there is no >>>>>>>>>> performance decrease when a pipeline is spammed with small ruta >>>>>>>>>> engines. >>>>>>>>>> - a basic minimal disjunct partitioning of the document for the >>>>>>>>>> coverage based visibility concept. >>>>>>>>>> >>>>>>>>>> Making RutaBasic optional is possible. If there is a real need for >>>>>>>>>> it, e.g., in order to reduce the memory footprint or when processing >>>>>>>>>> large documents where parts are simply not interesting, then I will >>>>>>>>>> put it on my TODO list. I am also open for other/new ideas how to >>>>>>>>>> solve the challenges (and for incremental usage of internal caches). >>>>>>>>>> >>>>>>>>>> What is your experience with the processing overhead concerning >>>>>>>>>> RutaBasic? Is it the rule matching or rather the initialization? I >>>>>>>>>> myself had already some performance problems with the initalization >>>>>>>>>> and memory consumption in large CAS (500+ pages pdfs). However, >>>>>>>>>> other components, serialization and the CAS editor were the actual >>>>>>>>>> bottlenecks. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>>>>>>>>>> I got around it by removing the default seeders by specifying an >>>>>>>>>>> empty seeders list since we don’t need the MARKUP annotations >>>>>>>>>>> anymore. >>>>>>>>>>> >>>>>>>>>>> I still don’t know why it created so much overhead but it sometimes >>>>>>>>>>> seemed to rival the POS tagger in processing time. >>>>>>>>>>> >>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the >>>>>>>>>>> creation of Ruta basic annotations entirely to save processing >>>>>>>>>>> overhead and only apply Ruta rules to other annotation types >>>>>>>>>>> created by other AEs such as our own? >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> Mario >>>>>>>>>>> >>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Peter, >>>>>>>>>>>> >>>>>>>>>>>> I noticed that occasionally the initialisation in >>>>>>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t >>>>>>>>>>>> really explain them and it seems independent of document length >>>>>>>>>>>> since I have seen this with even very small XML documents. >>>>>>>>>>>> >>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when >>>>>>>>>>>> creating MARKUP annotations during subiterator.moveToNext calls >>>>>>>>>>>> (line 89) and inside Subiterator it seems to be the while loop >>>>>>>>>>>> inside adjustForStrictForward (line 232), which is inside UIMA >>>>>>>>>>>> core classes. I haven’t gone into any deeper analysis yet but I >>>>>>>>>>>> first like to hear whether you have an idea what could be the main >>>>>>>>>>>> cause(s) for this? >>>>>>>>>>>> >>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Cheers >>>>>>>>>>>> Mario