Re: Very long Ruta stream initialization

Peter Klügl Thu, 07 Jan 2016 01:12:13 -0800

Hi,

thanks, that would be great. Patches are simply attached to the issue.
Non-trivial changes require an ICLA. Do you want to sign and submit it?


Best,

Peter
 

Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
> Thanks,
>
> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 
> <https://issues.apache.org/jira/browse/UIMA-4729>
>
> If you like, then we can also implement it and submit a patch, just let us 
> know what the process is.
>
> Cheers
> Mario
>
>> On 07 Jan 2016, at 09:08 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>
>> Hi,
>>
>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>> Hi Peter,
>>>
>>> I had a look at the test cases and I think there are many interesting and 
>>> useful features that cover many of our use cases but I will have to 
>>> experiment with them before I know what might be missing. I have a few 
>>> questions though:
>>>
>>> 1) It appears that we would then also be able to assign annotations to 
>>> lists, which is nice. I am not sure from looking at the tests whether it is 
>>> possible to use ADD with the annotation lists but I assume so.
>> Not yet, but I will implement it. It's still work in progress. But
>> thanks for pointing it out, I would probably have forgotten about it.
>>
>>> 2) The use of addresses is unclear to me just from reading the test, maybe 
>>> you could explain them.? This concept is very new to me.
>> It's not intented be to utilized directly in a rule file. It's rather
>> just a way to combine logic in java with ruta rules or use ruta
>> functionality in java code.
>> Let's say we have a new method like
>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>> and you call it with something like (syntax is not yet specified)
>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>> Then, the "$" would be replaced by the address of the annotation and the
>> method would return whether the annotation is covered by a Headline
>> annotation and is followed by a Keyword annotation.
>>
>>> 3) The annotation feature expression looks nice but I wonder whether an 
>>> array element can also be referenced using an int expression and not just a 
>>> constant e.g. Struct.as[intVar+1]{->T1};
>> Yes, without allowing number expressions, it would not really be useful.
>> The current implementation is just a test in order to check whether the
>> internal object model is good enough to cover it. The complete
>> functionality will probably not be included in the next release since
>> there is still much work left in order to get it up and running. The
>> semantics of such expressions (Struct.as) are resolved on the fly, and
>> the code odes not support expressions at all. I still have to think
>> about a way to implement it.
>>
>>> The label expressions are also useful and will make some of our rules more 
>>> readable.
>>>
>>> Finally I have one additional question to the MARKUP initialisation. I have 
>>> a case where I need the token seeds coming from the default seeder but I 
>>> don’t want to run the markup initialisation. Is there a separate seeder 
>>> defined for this somewhere? Right now I have my own copy of the default 
>>> seeder without the MARKUP initialisation but obviously I do not want to 
>>> maintain this. It looks as if they could also be split in two seeders with 
>>> both added as default and then I could overwrite with my own seeder list 
>>> containing only the token seeder.
>> Yes, we can split them or just add another one that ignores markup. I
>> was also always thinking about adding a DetailedSeeder that creates much
>> more finegrained types like different brackets and quotes... but it was
>> never on top of my todo list.
>>
>> Do you want to open a jira issue for it?
>>
>> Best,
>>
>> Peter
>>
>>> Cheers
>>> Mario
>>>
>>>
>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>> Hi Peter,
>>>>>
>>>>> No problem, I was anyway pretty much offline myself during Christmas 
>>>>> holidays.
>>>>>
>>>>> The term “overhead” is probably an exaggeration in this context 
>>>>> especially after I disabled the MARKUP initialisation. We implemented 
>>>>> earlier our own XML markup annotator tailored to better fit our needs 
>>>>> with additional annotation types and properties, so the Ruta MARKUP is 
>>>>> currently not used. It just happens that we don’t directly use RutaBasic 
>>>>> in any of our rules in this particular case so I was curious to know 
>>>>> whether we could avoid creating them in the first place since there seems 
>>>>> to be quite a few. However, overall processing required by our Ruta 
>>>>> scripts compared to other processing steps is now small and 
>>>>> sub-optimising this further by making RutaBasic optional would currently 
>>>>> be of very low priority to us. We would prioritise other features higher 
>>>>> e.g. being able to assign annotations to variables as we discussed 
>>>>> previously in another thread.
>>>> I am working on this right now and there is finally some first progress :-)
>>>>
>>>> I fear that I won't catch all use cases (combinations with language
>>>> elements) with the first attempt. If you are interested (and wanna take
>>>> care I do not miss your use case), feel free to take a look at the new
>>>> unit tests:
>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>
>>>> It's still work in progress. Proposals for more unit tests are very 
>>>> welcome.
>>>>
>>>>> We haven’t processed documents as large as those you mention since books 
>>>>> have so far been divided into chapters and processing could therefore be 
>>>>> parallelised accordingly. We also drop extreme outliers above a certain 
>>>>> size if we encounter them and then we batch process them later in smaller 
>>>>> chunks but this has so far not been necessary with our current data sets. 
>>>>> Like you, our processing bottlenecks are now in different components.
>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <peter.klu...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> sorry for the delayed reply.
>>>>>>
>>>>>> RutaEngine::initializeStream:
>>>>>>
>>>>>> The special treatment of MARKUPs that causes the increased time required 
>>>>>> for initialization is just a workaround because I was to lazy to write a 
>>>>>> working jflex rule. Well, I tried but failed. It shouldn't be hard be to 
>>>>>> improve this code... I will create an issue for it. When I did the last 
>>>>>> performance optimization, uima did not check the indexes yet and my test 
>>>>>> set did not contain markups.
>>>>>>
>>>>>> Deactivate creation of RutaBasic:
>>>>>> Short answer is no. I was already thinking about making RutaBasic 
>>>>>> optional in future so that the user can configure if they are used. 
>>>>>> However, right now, they are required for rule inference and make the 
>>>>>> rule inference "fast" in the first place. RutaBasic is just an internal 
>>>>>> annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and 
>>>>>> rules should not match on them at all.
>>>>>>
>>>>>> Some background information:
>>>>>>
>>>>>> RutaBasics are used for three things:
>>>>>> - store additional information in order to avoid index operations. Some 
>>>>>> useful conditions would require many index operations, e.g., PARTOF or 
>>>>>> ENDSWITH. RutaBasic is utilized as a cache what annotations start and 
>>>>>> end at which position, and which positions are covered by which types.
>>>>>> - provide a container to make this information available across analysis 
>>>>>> engines. Information shared by analysis engine is normally stored in the 
>>>>>> CAS, e.g. in annotations, (or in external resources). This is the role 
>>>>>> of RutaBasic. It is not really implemented right now as it should be but 
>>>>>> I will improve it soon. Then, there is no performance decrease when a 
>>>>>> pipeline is spammed with small ruta engines.
>>>>>> - a basic minimal disjunct partitioning of the document for the coverage 
>>>>>> based visibility concept.
>>>>>>
>>>>>> Making RutaBasic optional is possible. If there is a real need for it, 
>>>>>> e.g., in order to reduce the memory footprint or when processing large 
>>>>>> documents where parts are simply not interesting, then I will put it on 
>>>>>> my TODO list. I am also open for other/new ideas how to solve the 
>>>>>> challenges (and for incremental usage of internal caches).
>>>>>>
>>>>>> What is your experience with the processing overhead concerning 
>>>>>> RutaBasic? Is it the rule matching or rather the initialization? I 
>>>>>> myself had already some performance problems with the initalization and 
>>>>>> memory consumption in large CAS (500+ pages pdfs). However, other 
>>>>>> components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>> I got around it by removing the default seeders by specifying an empty 
>>>>>>> seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>
>>>>>>> I still don’t know why it created so much overhead but it sometimes 
>>>>>>> seemed to rival the POS tagger in processing time.
>>>>>>>
>>>>>>> Anyway, this leads me to the next question. Can I disable the creation 
>>>>>>> of Ruta basic annotations entirely to save processing overhead and only 
>>>>>>> apply Ruta rules to other annotation types created by other AEs such as 
>>>>>>> our own?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Mario
>>>>>>>
>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> I noticed that occasionally the initialisation in 
>>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t really 
>>>>>>>> explain them and it seems independent of document length since I have 
>>>>>>>> seen this with even very small XML documents.
>>>>>>>>
>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating 
>>>>>>>> MARKUP annotations during subiterator.moveToNext calls (line 89) and 
>>>>>>>> inside Subiterator it seems to be the while loop inside 
>>>>>>>> adjustForStrictForward (line 232), which is inside UIMA core classes. 
>>>>>>>> I haven’t gone into any deeper analysis yet but I first like to hear 
>>>>>>>> whether you have an idea what could be the main cause(s) for this?
>>>>>>>>
>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Mario
>

Re: Very long Ruta stream initialization

Reply via email to