Re: UIMA Ruta into jar?
On 14-10-23 09:40 AM, Piyush Paliwal wrote: Hi Richard, its seems to work now. Thanks. As I was only at testing stage, I forgot to add other descriptors (OpenNlpTagger, etc) prior to that Ruta descriptor in pipeline. Those were needed so that the CAS can find all types. Though, its a little hectic solution (copy and paste), but is workable and therefore is great. I am glad that you made it work! If you want to reduce XML boilerplate, you can look at uimaFIT [1], a library offering a very nice Java API to replace XML descriptors. Alexandre [1] http://uima.apache.org/uimafit.html Piyush On Thu, Oct 23, 2014 at 8:10 AM, Richard Eckart de Castilho wrote: On 23.10.2014, at 00:39, Piyush Paliwal wrote: As an example, I wish to import the following types from TypeSystem.xml descriptor which also resides in same folder as script (both files now in Java project). //import the additional annotations types and alias in short name IMPORT de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.NN FROM uima.ruta.example.TypeSystem AS _NN; IMPORT de.tudarmstadt.ukp.dkpro.core.api.syntax.type.constituent.PP FROM uima.ruta.example.TypeSystem AS _PP; I assume you are invoking Ruta via uimaFIT? If yes, then you should make sure that uimaFIT can find all necessary type systems via the type detection mechanism [1]. If you not using uimaFIT or if you have some special way to create your CASes, make sure that when the CAS is created, all types that all your scripts need are already loaded at that point. UIMA does not allow to change the type system while a pipeline is running. Thus the IMPORT declarations will normally not be interpreted when the script is executed. I do not know how the IMPORT (type) AS (alias) is implemented. If the alias is set up at execution time and not at CAS initialization time, it should work. Alexandre? Cheers, -- Richard [1] http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531
Re: UIMA Ruta into jar?
On 14-10-23 02:10 AM, Richard Eckart de Castilho wrote: On 23.10.2014, at 00:39, Piyush Paliwal wrote: As an example, I wish to import the following types from TypeSystem.xml descriptor which also resides in same folder as script (both files now in Java project). //import the additional annotations types and alias in short name IMPORT de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.NN FROM uima.ruta.example.TypeSystem AS _NN; IMPORT de.tudarmstadt.ukp.dkpro.core.api.syntax.type.constituent.PP FROM uima.ruta.example.TypeSystem AS _PP; I do not know how the IMPORT (type) AS (alias) is implemented. If the alias is set up at execution time and not at CAS initialization time, it should work. Alexandre? IMPORT instructions and aliases are resolved at the same time as TYPESYSTEM instructions, when the first CAS is processed. Best, Alexandre
Re: UIMA Ruta into jar?
Hi Piyush, A while ago, I wrote a blog post on how to package a RUTA script with maven: http://textjuicer.com/blog/2013/09/08/using-ruta-in-a-maven-project/ Even if you do not use maven, it should give you an idea on the files to distribute in your jars. Hope this help, Alexandre On 14-10-22 07:35 AM, Piyush Paliwal wrote: Hi, we are developing one Ruta Project and want to access it in java project. Currently what we did is to add the descriptor (generated from ruta script) into UIMA pipeline which is in java project. The pipeline can only be run on workspace, we are not able to make a single jar of that java project and run on command line because it can not access Ruta project as dependency. There is also a direct way to read ruta script within java, but the script can not import annotations from type systems if we put in java project (i.e. it needs Ruta editor). Any way to add Ruta project dependency into java? Thanks. Piyush
Re: sendCAS is slow
This is good news :) Did you try to increase the number of CAS in the pool as Jerry suggested [1]? You can reply to the list as well, there are a lot of people more knowledgeable than me that can help you there. [1] http://uima.markmail.org/search/?q=#query:+page:1+mid:4aa3ifmzg5zvj4bm+state:results On 24/09/2014 09:49, xym210 wrote: no, everything seems has worked right, when I deploy two collection reader instances, the processing speed improved -- 发自 Android 网易邮箱 在2014年09月24日 21:44, Alexandre Patry <mailto:alexandre.pa...@keatext.com>写 道: Did you get an error message or a stack trace? On 24/09/2014 09:38, xym210 wrote: it doesn't work, when I deploy the collectionReader and the AE colocated, it doesn't work either, is there something i misunderstood, thanks. -- 发自 Android 网易邮箱 在2014年09月23日 20:57, Alexandre Patry <mailto:alexandre.pa...@keatext.com>写 道: Did you try to use binary serialisation instead of XML serialisation for the CAS? For more information on binary serialisation, you can search for the word "binary" in the UIMA-AS user guide ( http://uima.apache.org/d/uima-as-2.6.0/uima_async_scaleout.html). Hope this help, Alexandre On 23/09/2014 03:16, xia yongmin wrote: > hi, > > I am a new one in uima, and i meet a problem as follow: > > Supposing I have a CollectionReader, an AE and a Cas Consumer, > > it tooks 1ms for CollectionReader to initialize a cas, 5ms for AE to analyze, > and 1ms for Cas Consumer to consume the cas. > > it seems that I can deploy 5 instances of AE to get five times speed. > > but when I deploy 3 instances of AE, it doesn't improve the speed. > > And I found that it took a long time for the UIMA to send a cas from > CollectionReader to AE using sendCAS(cas) method. > > how can I solve this problem? > > many thanks. > -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Re: sendCAS is slow
Did you get an error message or a stack trace? On 24/09/2014 09:38, xym210 wrote: it doesn't work, when I deploy the collectionReader and the AE colocated, it doesn't work either, is there something i misunderstood, thanks. -- 发自 Android 网易邮箱 在2014年09月23日 20:57, Alexandre Patry <mailto:alexandre.pa...@keatext.com>写 道: Did you try to use binary serialisation instead of XML serialisation for the CAS? For more information on binary serialisation, you can search for the word "binary" in the UIMA-AS user guide ( http://uima.apache.org/d/uima-as-2.6.0/uima_async_scaleout.html). Hope this help, Alexandre On 23/09/2014 03:16, xia yongmin wrote: > hi, > > I am a new one in uima, and i meet a problem as follow: > > Supposing I have a CollectionReader, an AE and a Cas Consumer, > > it tooks 1ms for CollectionReader to initialize a cas, 5ms for AE to analyze, > and 1ms for Cas Consumer to consume the cas. > > it seems that I can deploy 5 instances of AE to get five times speed. > > but when I deploy 3 instances of AE, it doesn't improve the speed. > > And I found that it took a long time for the UIMA to send a cas from > CollectionReader to AE using sendCAS(cas) method. > > how can I solve this problem? > > many thanks. > -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Re: sendCAS is slow
Did you try to use binary serialisation instead of XML serialisation for the CAS? For more information on binary serialisation, you can search for the word "binary" in the UIMA-AS user guide (http://uima.apache.org/d/uima-as-2.6.0/uima_async_scaleout.html). Hope this help, Alexandre On 23/09/2014 03:16, xia yongmin wrote: hi, I am a new one in uima, and i meet a problem as follow: Supposing I have a CollectionReader, an AE and a Cas Consumer, it tooks 1ms for CollectionReader to initialize a cas, 5ms for AE to analyze, and 1ms for Cas Consumer to consume the cas. it seems that I can deploy 5 instances of AE to get five times speed. but when I deploy 3 instances of AE, it doesn't improve the speed. And I found that it took a long time for the UIMA to send a cas from CollectionReader to AE using sendCAS(cas) method. how can I solve this problem? many thanks. -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Re: RUTA: case insensitive regex rule?
On 29/08/2014 03:34, Renaud Richardet wrote: (How) can I make the following rule Case Insensitive? "\\b((inter)?neurone?s?|cells?)\\b" -> Neuron; You can turn the "ignore case" flag by prefixing your regex with (?i): "(?i)\\b((inter)?neurone?s?|cells?)\\b" -> Neuron; Hope this help, Alexandre -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Re: Ruta - best practices for unit tests?
Hi Renaud, On 14-08-18 05:30 PM, Renaud Richardet wrote: Hello, What are best practices for writing unit tests for Ruta? Ideally, I would like to have 1) tests that can be run on the command line (so as to automate them in Jenkins), and We use JUnit and it works quite well for us. I have a small example project on github with a RUTA script and its unit test (https://github.com/apatry/ruta-with-maven). You can also look at RUTA test suite if you want more examples (http://svn.apache.org/viewvc/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/). 2) where input and expected output can be edited in a text editor (meaning: not xmi's or java code). Is there a reason why you want to avoid Java code for unit tests? Building and inspecting CAS in Java for each test allow a lot of flexibility and makes it possible to test each analysis engine outside of its pipeline. And uimaFIT is an excellent tool for that (http://uima.apache.org/uimafit.html). Hope this help, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: Loading a resource from the classpath
Well, the error was on the other side of the screen. This works perfectly well with fileUrl, I only had a typo in my path. On 07/08/2014 11:37, Alexandre Patry wrote: Hi, I would like to locate a resource in the classpath, something along the lines of: LocationDictionary Dictionary of locations path/in/jar/location-dictionary.xml org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource_impl Is it possible with existing resource specifiers or do I have to write my own and use a CustomResourceSpecifier? Thanks! Alexandre -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Loading a resource from the classpath
Hi, I would like to locate a resource in the classpath, something along the lines of: LocationDictionary Dictionary of locations path/in/jar/location-dictionary.xml org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource_impl Is it possible with existing resource specifiers or do I have to write my own and use a CustomResourceSpecifier? Thanks! Alexandre -- Alexandre Patry, Ph.D Chercheur Principal / Principal Researcher http://KeaText.com
Re: dinamically type system creation
Hi Tiziano, On 13/05/2014 09:55, Tiziano Lorenzetti wrote: Dear all, I'm new to UIMA and I'm trying to develope an annotator that creates dinamically a type system with serveral feature structure. To accomplish this, the annotator does: ... TypeSystemDescription tsd = TypeSystemDescriptionFactory.createTypeSystemDescription(new String[0]); tsd.addType("it.uniroma2.art.ExcelAnnotation", "", "uima.tcas.Annotation"); TypeDescription type = tsd.getType("it.uniroma2.art.ExcelAnnotation"); type.addFeature("newUIMAFeature", "", "uima.cas.String"); ... In another annotator, I try to access this type system and its features in this way: TypeSystem ts = aCAS.getTypeSystem(); Iterator types = ts.getTypeIterator(); Iterator features = ts.getFeatures(); but neither the type system and its features are present. How could I reach my goal? How do you create your CAS? I guess the types should be found if you create it using: CAS aCAS = CasCreationUtils.createCas(typeSystemDescription, null, null); Hope this help, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: RandomAccessFile problem in UIMA
Hi Debbie, I do not use eclipse, I won't be of any help regarding maven and eclipse interoperability. The simplest thing is probably to download extJWNL from http://sourceforge.net/projects/extjwnl/files/ and add all jars under the lib/ directory in your project. Once it is done, you should be able to load the dictionary with the following line of code: Dictionary dictionary = Dictionary.getDefaultResourceInstance(); Let me know if it helps, Alexandre On 14-05-02 08:47 PM, Debbie Zhang wrote: Thanks Alexandre for your reply! I will try extJWNL as suggested. As I have never used maven, may I ask which maven Eclipse plugin you use? Thanks again for your help! Regards, Debbie -Original Message- From: Alexandre Patry [mailto:alexandre.pa...@keatext.com] Sent: Saturday, 3 May 2014 12:13 AM To: user@uima.apache.org Subject: Re: RandomAccessFile problem in UIMA Hi Debbie, I recommend you to use extJWNL (https://github.com/extjwnl/extjwnl) instead of JWNL. We made the switch from JWNL and never looked back. For your path problems, extJWNL distribute WordNet dictionaries as maven dependencies. It should become a non-issue. Hope this help, Alexandre On 02/05/2014 03:36, Debbie Zhang wrote: Hi, I am having problems to use JWNL wordnet in UIMA. JWNL uses RandomAccessFile to read wordnet dictionary files. In order to create a PEAR file, wordnet dictionary files are put in resources/wordnet folder under project. As resources is in my Build Path, I have no problem to run the application I created in Eclipse. Therefore, I am certain the dictionary files can be read. However, when I use UIMA Document Analyzer or UIMA CAS Visual Debugger to run the annotation, I get the following error: java.io.FileNotFoundException: resources/wordnet/data.noun (No such file or directory) The error comes from the following code: RandomAccess _file = new RandomAccessFile(path, _permissions); I use the following code to check the current working directory of the class: URL location = PrincetonRandomAccessDictionaryFile.class.getProtectionDomain().getCod eSourc e().getLocation(); System.out.println(location.getFile()); It seems both situation have the same location: /project/bin/ Did anyone encounter a similar problem before? Any suggestion is welcome. Thank you! Regards, Debbie -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: RandomAccessFile problem in UIMA
Hi Debbie, I recommend you to use extJWNL (https://github.com/extjwnl/extjwnl) instead of JWNL. We made the switch from JWNL and never looked back. For your path problems, extJWNL distribute WordNet dictionaries as maven dependencies. It should become a non-issue. Hope this help, Alexandre On 02/05/2014 03:36, Debbie Zhang wrote: Hi, I am having problems to use JWNL wordnet in UIMA. JWNL uses RandomAccessFile to read wordnet dictionary files. In order to create a PEAR file, wordnet dictionary files are put in resources/wordnet folder under project. As resources is in my Build Path, I have no problem to run the application I created in Eclipse. Therefore, I am certain the dictionary files can be read. However, when I use UIMA Document Analyzer or UIMA CAS Visual Debugger to run the annotation, I get the following error: java.io.FileNotFoundException: resources/wordnet/data.noun (No such file or directory) The error comes from the following code: RandomAccess _file = new RandomAccessFile(path, _permissions); I use the following code to check the current working directory of the class: URL location = PrincetonRandomAccessDictionaryFile.class.getProtectionDomain().getCodeSourc e().getLocation(); System.out.println(location.getFile()); It seems both situation have the same location: /project/bin/ Did anyone encounter a similar problem before? Any suggestion is welcome. Thank you! Regards, Debbie -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: UIMA-OPENNLP
UIMA descriptors are distributed with the source code of opennlp-uima. You can grab them from http://svn.apache.org/viewvc/opennlp/trunk/opennlp-uima/descriptors/. Hope this help, Alexandre On 03/04/2014 18:53, Pathima Nusrath Hameed wrote: Hi, I am interested in using UIMA for clinical text data processing. I am working on WIndows7 platform. I installed UIMA but I could not configure OpenNLP tools. OpenNLP descriptors are not available in UIMA. I am glad if you could help me in this matter. I appreciate your reply. Thank you -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: Serializing multiple JCas objects to a single file
On 14-02-01 08:22 PM, Samudra Banerjee wrote: Hi Experts, I have a scenario where processing a wikipedia XML dump generates a huge number of JCas objects (~1 million), one per page. I want to serialize these JCas objects for later use, but generating 1 million different files will take a toll on the system. So I was wondering if there was a way to serialize multiple JCas objects to a single file for later retrieval. Any idea if this can be achieved? The JDK provide classes to read and write zip files (see http://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html). You could serialize each JCas in an entry of a zip file. Best, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: UIMA Ruta 2.1.0 Issues
On 2013-12-17 12:10, Peter Klügl wrote: Am 17.12.2013 18:00, schrieb Alexandre Patry: On 2013-12-17 11:56, Peter Klügl wrote: Hi, some of the rules behave as expected. It's maybe a bit counterintuitive, but I do not see a way to improve it. I will fix the rest in the next few days. An example: (SPECIAL ALL* SPECIAL) {-> MARK(TMP_GenericAllSTAR)}; ALL is a parent type of SPECIAL and * is a greedy quantifier. Therefore ALL matches on all annotations and also on the SPECIAL annotations until the end of the document. Then, there is no SPECIAL annotation left to match and the rule fails. Using a reluctant quantifier should work as expected for this specific case case: (SPECIAL ALL*? SPECIAL) {-> MARK(TMP_GenericAllSTAR)}; Just another comment that has nothing to do with the problem :-) The rule is of course somewhat "slow". I would rather rewrite it in: (SPECIAL # SPECIAL) {-> MARK(TMP_GenericAllSTAR)}; Here, the wildcard searches for the next SPECIAL annotation in the index and has not to match on each token until the next SPECIAL annotation. Nice trick, thanks for sharing! Is there a cookbook somewhere where all these tricks are stored? -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: UIMA Ruta 2.1.0 Issues
On 2013-12-17 11:56, Peter Klügl wrote: Hi, some of the rules behave as expected. It's maybe a bit counterintuitive, but I do not see a way to improve it. I will fix the rest in the next few days. An example: (SPECIAL ALL* SPECIAL) {-> MARK(TMP_GenericAllSTAR)}; ALL is a parent type of SPECIAL and * is a greedy quantifier. Therefore ALL matches on all annotations and also on the SPECIAL annotations until the end of the document. Then, there is no SPECIAL annotation left to match and the rule fails. Using a reluctant quantifier should work as expected for this specific case case: (SPECIAL ALL*? SPECIAL) {-> MARK(TMP_GenericAllSTAR)}; Hope this help, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: Macros in Ruta? - How to make long scripts short?
On 2013-12-06 10:46, Richard Eckart de Castilho wrote: Hi, assuming I have a Ruta script with recurring statements of the type PartOfSpeech{FEATURE("value", "N") Is it possible to define some kind of macro to replace this long statement with a short-hand? MACRO N := PartOfSpeech{FEATURE("value", "N")} MACRO V := PartOfSpeech{FEATURE("value", "V")} N{0,2} V From what I know, RUTA does not support macro yet. The closest thing I found in Ruta for such a thing was a Block - but doesn't seem to do what I want, because I would need to ->CALL it. I would define temporary annotations for N and V. The compromise on performance is not the same though. It consumes more memory, but searching for N or V does not require to scan all part-of-speeches annotations anymore. Hope this help, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: Problem writing ruta extensions
On 2013-12-04 12:33, Sebastian wrote: Hi, I'm highly interested in ruta, and its potential applications in industrial applications. Right know I'm trying to create a simple toy condition extension that is simply a case insensitive INLIST condition. It is completely based on the InListCondition class, I also declared an implementation of the IRutaConditionExtension interface. With primitve types everything seems to work great, except when the condition is used with a variable : STRINGLIST MonthsList = {"january", ...}; DECLARE Month; ANY{INSENSITIVEINLIST(MonthsList) -> MARK(Month)}; I get a class cast exception when the condition is being created, because MonthsList is a SimpleTypeExpression and I'm expecting a StringListExpression. Am I doing something wrong ? I suppose there is a way to resolve the variable to the actual list, but I missed it somehow. It may not help you to get your toy extension working, but for small lists I like to use regular expressions where case insensitiveness is free: W{REGEXP("(?i)january|february|march|...|december") -> MARK(Month)} Regards, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com
Re: [ruta] How to efficiently delete an annotation only if it appears within the N first token of a document?
On 2013-08-28 15:20, Peter Klügl wrote: Am 28.08.2013 20:33, schrieb Alexandre Patry: On 2013-08-28 12:19, Peter Klügl wrote: On 28.08.2013 18:17, Alexandre Patry wrote: I will be happy to test drive MARKFIRST when it will be in trunk. It's already in the trunk. If you want, then I can also think of something that avoid the visibility problem. I was able to make it work in my application, but my eclipse plugin does not recognize the MARKFIRST keyword. Here is what I did : 1. Uninstall the RUTA Workbench plugin from eclipse 2. `mvn clean install` in ruta/trunk 3. `mvn clean package -Declipse.home=/usr/lib/eclipse -Duima-eclipse-jar-processor=/usr/lib/eclipse/plugins/org.eclipse.equinox.p2.jarprocessor_1.0.200.dist.jar -Declipse-equinox-launcher=/usr/lib/eclipse/plugins/org.eclipse.equinox.launcher_1.2.0.dist.jar` in ruta/trunk/ruta-eclipse-update-site 4. Re-install the RUTA Workbench plugin in eclipse from ruta/trunk/ruta-eclipse-update-site/target/eclipse-update-site Did I miss something? I will do some testing tomorrow, but my first guess is that uninstall does not remove the plugins, only the feature. When you install the feature again with the same version, then the plugins have not changed as they are already present in the same version. You could try to simply replace the plugins in your eclipse installation and restart it with -clean. Your guess is right :) I uninstalled ruta from eclipse, removed ruta jars in $ECLIPSE_HOME/plugins and removed all entries referencing ruta in $ECLIPSE_HOME/artifacts.xml. I then installed it again from eclipse and now it is working. -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools
Re: [ruta] How to efficiently delete an annotation only if it appears within the N first token of a document?
On 2013-08-28 12:19, Peter Klügl wrote: On 28.08.2013 18:17, Alexandre Patry wrote: I will be happy to test drive MARKFIRST when it will be in trunk. It's already in the trunk. If you want, then I can also think of something that avoid the visibility problem. I was able to make it work in my application, but my eclipse plugin does not recognize the MARKFIRST keyword. Here is what I did : 1. Uninstall the RUTA Workbench plugin from eclipse 2. `mvn clean install` in ruta/trunk 3. `mvn clean package -Declipse.home=/usr/lib/eclipse -Duima-eclipse-jar-processor=/usr/lib/eclipse/plugins/org.eclipse.equinox.p2.jarprocessor_1.0.200.dist.jar -Declipse-equinox-launcher=/usr/lib/eclipse/plugins/org.eclipse.equinox.launcher_1.2.0.dist.jar` in ruta/trunk/ruta-eclipse-update-site 4. Re-install the RUTA Workbench plugin in eclipse from ruta/trunk/ruta-eclipse-update-site/target/eclipse-update-site Did I miss something? Thanks, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools
Re: [ruta] How to efficiently delete an annotation only if it appears within the N first token of a document?
On 2013-08-28 11:25, Peter Klügl wrote: On 28.08.2013 16:52, Alexandre Patry wrote: Hi, I use RUTA and I want to delete an annotation if it is within the first 50 tokens of a document. I came up with the following rules : ANY{POSITION(Document, 1)-> Header};// Annotate the first token in the document Header{->SHIFT(Header, 1, 2)} ANY[0,49];// Appends the 49 following tokens ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the first ToDelete if it is within the header These rules work as expected but they are *really* slow. Is there a faster way to achieve that? Oh yes, the first rule is really slow. I always miss an action MARKFIRST (as there is a MARKLAST). I will add it today or tomorrow. There are two reasons why the first rule is slow: ANY has to look at all tokens and POSITION is just the slowest condition in Ruta. For now you could use a rule like: ANY{STARTSWITH(Document)-> Header}; ... which avoids at least the POSITION condition. A simple test with a 200 W document: ... ANY{POSITION(Document, 1)-> Header}; // [0.274s|93.52%] Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.090s|3.07%] ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.030s|1.02%] ... ANY{STARTSWITH(Document)-> Header}; // [0.047s|50.00%] Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.029s|30.85%] ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.011s|11.7%] well, that's still slow (in debug mode) and I actually wonder why the other rules are getting faster... but I hope that the performance will soon be improved :-) Just tried it and it is much better, thanks! Many of my documents start with space, so I had to update the rules to : Document{-> ADDRETAINTYPE(SPACE, BREAK)}; ANY{STARTSWITH(Document) -> Header}; // if the first token is a space, use the first non-space following it Header{IS({SPACE, BREAK}) -> UNMARK(Header)} ANY*? ANY{-PARTOF({SPACE, BREAK}) -> MARK(Header)}; Document{-> REMOVERETAINTYPE(SPACE, BREAK)}; Header{->SHIFT(Header, 1, 2)} ANY[0,49]; ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; I will be happy to test drive MARKFIRST when it will be in trunk. Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools
[ruta] How to efficiently delete an annotation only if it appears within the N first token of a document?
Hi, I use RUTA and I want to delete an annotation if it is within the first 50 tokens of a document. I came up with the following rules : ANY{POSITION(Document, 1)-> Header};// Annotate the first token in the document Header{->SHIFT(Header, 1, 2)} ANY[0,49];// Appends the 49 following tokens ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the first ToDelete if it is within the header These rules work as expected but they are *really* slow. Is there a faster way to achieve that? Thanks, Alexandre -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools
Re: Multi-view CAS and sofa-unaware AE
On 13-04-03 11:10 AM, Peter Klügl wrote: Yes, but imagine you have a CAS with 10 views and you want to apply a primitive sofa-unaware AE on each view. The easiest solution I found was to write a template AAE descriptor, replaced the AE descriptor and sofa name (and mapping), instantiate the AAE, call process(), and then repeat that for the next view. This can get quite ugly, if you have to override parameters and you do not know the primitive AE and its parameters. If you are willing to use uimafit, you could do it in a simple for loop. It would look like this : // build an aggregate that will run the same analysis engine on many sofas final AggregateBuilder builder = new AggregateBuilder(); for (String sofa : sofas) { final AnalysisEngineDescription annotator = AnalysisEngineFactory.createPrimitiveDescription(YourEngine.class, paramName1, paramValue1, paramName2, paramValue2, ...); builder.add(annotator, "_InitialView", sofa); } final AnalysisEngine engine = builder.createAggregate(); // you can then user your engine The documentation on the web site ( https://code.google.com/p/uimafit/) is quite good if you want more information. Regards, Alexandre Best, Peter On 03.04.2013 14:38, Jörn Kottmann wrote: Yes, you can use the sofa mapping, to map some view to the _InitialView. Have a look here: http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.mvs.sofa_name_mapping Jörn On 04/03/2013 02:19 PM, Peter Klügl wrote: Hi, sorry for this beginner question: It there a shortcut to apply a sofa-unaware AE on CAS view that is not the _InitialView? It seems quite cumbersome to programmatically generate an aggregate analysis engine description to wrap to sofa-unaware engine. Best, Peter -- Alexandre Patry, Ph.D Chercheur / Researcher http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools
Re: CAS Visualisation
On 2012-10-16, at 8:31 AM, Andreas Niekler wrote: > Dear UIMA Users, > > i wonder what the best practice would be to render a CAS as a html snippet > that could be included into a webpage. I already found the > AnnotationViewGenerator which is producing complete html files which is far > to much as i just want to generate snippets. > > Has anybody a nice library or script to easily convert a cas to a html based > structure? I do not know if there is a class doing what you want from a CAS, but it is easy to extract snippets of html from a document using a library like jsoup (http://jsoup.org). For example, you could extract the body content in the following way : // retrieve complete document html String html = … // extract html under body String snippet = Jsoup.parse(html).select("body").html(); Hope this help, Alexandre > > Thank you very much > > -- > Andreas Niekler, Dipl. Ing. (FH) > NLP Group | Department of Computer Science > University of Leipzig > Johannisgasse 26 | 04103 Leipzig > > mail: aniek...@informatik.uni-leipzig.deg.de
Re: Using JCasGen outside eclipse
On Fri 12 Oct 2012 02:41:45 PM EDT, Himanshu Gahlot wrote: Hi, Is it possible to use the JCas generator utility in some other IDE (IntelliJ Idea, to be specific) other than Eclipse? Something where I just need to write the xml for the new type and the corresponding Java class gets generated using a call to some uima class/script. I use jcasgen along with maven in Intellij IDEA. Here are the specific snippets for maven : [...] org.uimafit uimafit 1.4.0 [...] org.codehaus.mojo exec-maven-plugin 1.2.1 jcasgen generate-sources java org.uimafit.util.JCasGenPomFriendly file:${project.basedir}/src/main/resources/path/to/your/types/*.xml ${project.build.directory}/generated-sources/uima org.codehaus.mojo build-helper-maven-plugin 1.7 add-uima-sources generate-sources add-source ${project.build.directory}/generated-sources/uima Looking at jcasgen.sh, you could also call the class org.apache.uima.tools.jcasgen.Jg from a run configuration. Hope this help, Alexandre -- Alexandre Patry Ingénieur-Chercheur http://KeaText.com Transformez vos documents en outils de décision << Turn your documents into decision tools