Re: Very long Ruta stream initialization

Peter Klügl Thu, 07 Jan 2016 01:41:02 -0800

Here's the description of the UIMA site:
https://uima.apache.org/get-involved.html


Here's the description of general apache process:
http://www.apache.org/dev/new-committers-guide.html#cla

A short summary of what is to do:
- complete the ICLA (http://www.apache.org/licenses/icla.pdf), print it,
sign it and scan it
- maybe do the same for the CCLA
(http://www.apache.org/licenses/cla-corporate.txt) if your employer
requires it and you did the contribution/implementation during work time
- send the scanned document (or both) to secret...@apache.org

"apache id" and "notify project" are optional but I would add it (so
that we get informed that the documents have been processed, and you
already have an id in case you would gain comitter rights).

I hope I have not forgotten something...

Best,

Peter


Am 07.01.2016 um 10:22 schrieb Mario Gazzo:
> Yes, where do we sign this?
>
> :-)
>
>> On 07 Jan 2016, at 10:16 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>
>> :-) let me know if you need help or have any questions.
>>
>> Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
>>> Yes, let us just sign and submit it.
>>>
>>>> On 07 Jan 2016, at 10:11 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> thanks, that would be great. Patches are simply attached to the issue.
>>>> Non-trivial changes require an ICLA. Do you want to sign and submit it?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>>>> Thanks,
>>>>>
>>>>> I just added the JIRA issue: 
>>>>> https://issues.apache.org/jira/browse/UIMA-4729 
>>>>> <https://issues.apache.org/jira/browse/UIMA-4729>
>>>>>
>>>>> If you like, then we can also implement it and submit a patch, just let 
>>>>> us know what the process is.
>>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> I had a look at the test cases and I think there are many interesting 
>>>>>>> and useful features that cover many of our use cases but I will have to 
>>>>>>> experiment with them before I know what might be missing. I have a few 
>>>>>>> questions though:
>>>>>>>
>>>>>>> 1) It appears that we would then also be able to assign annotations to 
>>>>>>> lists, which is nice. I am not sure from looking at the tests whether 
>>>>>>> it is possible to use ADD with the annotation lists but I assume so.
>>>>>> Not yet, but I will implement it. It's still work in progress. But
>>>>>> thanks for pointing it out, I would probably have forgotten about it.
>>>>>>
>>>>>>> 2) The use of addresses is unclear to me just from reading the test, 
>>>>>>> maybe you could explain them.? This concept is very new to me.
>>>>>> It's not intented be to utilized directly in a rule file. It's rather
>>>>>> just a way to combine logic in java with ruta rules or use ruta
>>>>>> functionality in java code.
>>>>>> Let's say we have a new method like
>>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>>>>> and you call it with something like (syntax is not yet specified)
>>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>>>> Then, the "$" would be replaced by the address of the annotation and the
>>>>>> method would return whether the annotation is covered by a Headline
>>>>>> annotation and is followed by a Keyword annotation.
>>>>>>
>>>>>>> 3) The annotation feature expression looks nice but I wonder whether an 
>>>>>>> array element can also be referenced using an int expression and not 
>>>>>>> just a constant e.g. Struct.as[intVar+1]{->T1};
>>>>>> Yes, without allowing number expressions, it would not really be useful.
>>>>>> The current implementation is just a test in order to check whether the
>>>>>> internal object model is good enough to cover it. The complete
>>>>>> functionality will probably not be included in the next release since
>>>>>> there is still much work left in order to get it up and running. The
>>>>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>>>>> the code odes not support expressions at all. I still have to think
>>>>>> about a way to implement it.
>>>>>>
>>>>>>> The label expressions are also useful and will make some of our rules 
>>>>>>> more readable.
>>>>>>>
>>>>>>> Finally I have one additional question to the MARKUP initialisation. I 
>>>>>>> have a case where I need the token seeds coming from the default seeder 
>>>>>>> but I don’t want to run the markup initialisation. Is there a separate 
>>>>>>> seeder defined for this somewhere? Right now I have my own copy of the 
>>>>>>> default seeder without the MARKUP initialisation but obviously I do not 
>>>>>>> want to maintain this. It looks as if they could also be split in two 
>>>>>>> seeders with both added as default and then I could overwrite with my 
>>>>>>> own seeder list containing only the token seeder.
>>>>>> Yes, we can split them or just add another one that ignores markup. I
>>>>>> was also always thinking about adding a DetailedSeeder that creates much
>>>>>> more finegrained types like different brackets and quotes... but it was
>>>>>> never on top of my todo list.
>>>>>>
>>>>>> Do you want to open a jira issue for it?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>> Cheers
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <peter.klu...@averbis.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> No problem, I was anyway pretty much offline myself during Christmas 
>>>>>>>>> holidays.
>>>>>>>>>
>>>>>>>>> The term “overhead” is probably an exaggeration in this context 
>>>>>>>>> especially after I disabled the MARKUP initialisation. We implemented 
>>>>>>>>> earlier our own XML markup annotator tailored to better fit our needs 
>>>>>>>>> with additional annotation types and properties, so the Ruta MARKUP 
>>>>>>>>> is currently not used. It just happens that we don’t directly use 
>>>>>>>>> RutaBasic in any of our rules in this particular case so I was 
>>>>>>>>> curious to know whether we could avoid creating them in the first 
>>>>>>>>> place since there seems to be quite a few. However, overall 
>>>>>>>>> processing required by our Ruta scripts compared to other processing 
>>>>>>>>> steps is now small and sub-optimising this further by making 
>>>>>>>>> RutaBasic optional would currently be of very low priority to us. We 
>>>>>>>>> would prioritise other features higher e.g. being able to assign 
>>>>>>>>> annotations to variables as we discussed previously in another thread.
>>>>>>>> I am working on this right now and there is finally some first 
>>>>>>>> progress :-)
>>>>>>>>
>>>>>>>> I fear that I won't catch all use cases (combinations with language
>>>>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>>>>> unit tests:
>>>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>>>>>
>>>>>>>> It's still work in progress. Proposals for more unit tests are very 
>>>>>>>> welcome.
>>>>>>>>
>>>>>>>>> We haven’t processed documents as large as those you mention since 
>>>>>>>>> books have so far been divided into chapters and processing could 
>>>>>>>>> therefore be parallelised accordingly. We also drop extreme outliers 
>>>>>>>>> above a certain size if we encounter them and then we batch process 
>>>>>>>>> them later in smaller chunks but this has so far not been necessary 
>>>>>>>>> with our current data sets. Like you, our processing bottlenecks are 
>>>>>>>>> now in different components.
>>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <peter.klu...@averbis.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> sorry for the delayed reply.
>>>>>>>>>>
>>>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>>>
>>>>>>>>>> The special treatment of MARKUPs that causes the increased time 
>>>>>>>>>> required for initialization is just a workaround because I was to 
>>>>>>>>>> lazy to write a working jflex rule. Well, I tried but failed. It 
>>>>>>>>>> shouldn't be hard be to improve this code... I will create an issue 
>>>>>>>>>> for it. When I did the last performance optimization, uima did not 
>>>>>>>>>> check the indexes yet and my test set did not contain markups.
>>>>>>>>>>
>>>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>>>> Short answer is no. I was already thinking about making RutaBasic 
>>>>>>>>>> optional in future so that the user can configure if they are used. 
>>>>>>>>>> However, right now, they are required for rule inference and make 
>>>>>>>>>> the rule inference "fast" in the first place. RutaBasic is just an 
>>>>>>>>>> internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and 
>>>>>>>>>> RutaFrame, and rules should not match on them at all.
>>>>>>>>>>
>>>>>>>>>> Some background information:
>>>>>>>>>>
>>>>>>>>>> RutaBasics are used for three things:
>>>>>>>>>> - store additional information in order to avoid index operations. 
>>>>>>>>>> Some useful conditions would require many index operations, e.g., 
>>>>>>>>>> PARTOF or ENDSWITH. RutaBasic is utilized as a cache what 
>>>>>>>>>> annotations start and end at which position, and which positions are 
>>>>>>>>>> covered by which types.
>>>>>>>>>> - provide a container to make this information available across 
>>>>>>>>>> analysis engines. Information shared by analysis engine is normally 
>>>>>>>>>> stored in the CAS, e.g. in annotations, (or in external resources). 
>>>>>>>>>> This is the role of RutaBasic. It is not really implemented right 
>>>>>>>>>> now as it should be but I will improve it soon. Then, there is no 
>>>>>>>>>> performance decrease when a pipeline is spammed with small ruta 
>>>>>>>>>> engines.
>>>>>>>>>> - a basic minimal disjunct partitioning of the document for the 
>>>>>>>>>> coverage based visibility concept.
>>>>>>>>>>
>>>>>>>>>> Making RutaBasic optional is possible. If there is a real need for 
>>>>>>>>>> it, e.g., in order to reduce the memory footprint or when processing 
>>>>>>>>>> large documents where parts are simply not interesting, then I will 
>>>>>>>>>> put it on my TODO list. I am also open for other/new ideas how to 
>>>>>>>>>> solve the challenges (and for incremental usage of internal caches).
>>>>>>>>>>
>>>>>>>>>> What is your experience with the processing overhead concerning 
>>>>>>>>>> RutaBasic? Is it the rule matching or rather the initialization? I 
>>>>>>>>>> myself had already some performance problems with the initalization 
>>>>>>>>>> and memory consumption in large CAS (500+ pages pdfs). However, 
>>>>>>>>>> other components, serialization and the CAS editor were the actual 
>>>>>>>>>> bottlenecks.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>>>> I got around it by removing the default seeders by specifying an 
>>>>>>>>>>> empty seeders list since we don’t need the MARKUP annotations 
>>>>>>>>>>> anymore.
>>>>>>>>>>>
>>>>>>>>>>> I still don’t know why it created so much overhead but it sometimes 
>>>>>>>>>>> seemed to rival the POS tagger in processing time.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the 
>>>>>>>>>>> creation of Ruta basic annotations entirely to save processing 
>>>>>>>>>>> overhead and only apply Ruta rules to other annotation types 
>>>>>>>>>>> created by other AEs such as our own?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Mario
>>>>>>>>>>>
>>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> I noticed that occasionally the initialisation in 
>>>>>>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t 
>>>>>>>>>>>> really explain them and it seems independent of document length 
>>>>>>>>>>>> since I have seen this with even very small XML documents.
>>>>>>>>>>>>
>>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when 
>>>>>>>>>>>> creating MARKUP annotations during subiterator.moveToNext calls 
>>>>>>>>>>>> (line 89) and inside Subiterator it seems to be the while loop 
>>>>>>>>>>>> inside adjustForStrictForward (line 232), which is inside UIMA 
>>>>>>>>>>>> core classes. I haven’t gone into any deeper analysis yet but I 
>>>>>>>>>>>> first like to hear whether you have an idea what could be the main 
>>>>>>>>>>>> cause(s) for this?
>>>>>>>>>>>>
>>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Mario

Re: Very long Ruta stream initialization

Reply via email to