Re: run existing AE instance on different view
Hi Marshall, as far as I can tell all the mapping methods described there need to be applied *before* instantiating an AE. The problem is that while I can use CAS.getView(...) or JCas.getView(...) to access the desired view I find no way to call the process() method of an existing AE instance on it. One application we have right now is having a pretty memory-heavy pipeline loaded into memory that we need to apply to texts from different sources (typically as a web service). Depending on the source we may need to first apply translation, cleanup, etc., all of which create new views on which to operate. We are not using CPE or any other "standard" execution engine but rather create the initial JCas from the incoming text and then apply aggregate engines (using their process() method) to that JCas as needed. Best, Jens On Mon, Jul 9, 2018 at 10:46 PM, Marshall Schor wrote: > Hi, > > Is anything in > https://uima.apache.org/d/uimaj-2.10.2/tutorials_and_ > users_guides.html#ugr.tug.mvs.name_mapping_application > helpful? > > If not, could you add some details that says why not? > > -Marshall > > > On 7/5/2018 8:52 AM, Jens Grivolla wrote: > > Hi, > > > > I'm trying to run an already instantiated AE on a view other than > > _InitialView. Unfortunately, I can't just call process() on the desired > > view, as there is a call to Util.getStartingView(...) > > in PrimitiveAnalysisEngine_impl that forces it back to _InitialView. > > > > The view mapping methods I found (e.g. using and AggregateBuilder) work > on > > AE descriptions, so I would need to create additional instances (with the > > corresponding memory overhead). Is there a way to remap/rename the views > in > > a JCas before calling process() so that the desired view is seen as the > > _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used > > for this, but it doesn't feel quite right. > > > > Best, > > Jens > > > >
Re: run existing AE instance on different view
Hi Eddie, unfortunately for the most part we can't (easily) change the AEs to make them SofA-aware (many of them come from DKPro). If no better solutions come up, I guess we will go with copying the view to be processed so it is always accessible the same way (either as _InitialView or with a different name that we always statically map to _InitialView). Thanks, Jens On Tue, Jul 10, 2018 at 3:58 PM, Eddie Epstein wrote: > I think the UIMA code uses the annotator context to map the _InitialView > and the context remains static for the life of the annotator. Replicating > annotators to handle different views has been used here too, but agree it > is ugly. > > If the annotator code can be changed, then one approach would be to put > some information in a fixed _IntialView that specifies which named view(s) > should be analyzed and have all down stream annotators use that to select > the view(s) to operate on. > > Also sounds possible to have a single new component use the cascopier to > create a new view that is always the one processed. > > Regards, > Eddie > > On Thu, Jul 5, 2018 at 8:52 AM, Jens Grivolla wrote: > > > Hi, > > > > I'm trying to run an already instantiated AE on a view other than > > _InitialView. Unfortunately, I can't just call process() on the desired > > view, as there is a call to Util.getStartingView(...) > > in PrimitiveAnalysisEngine_impl that forces it back to _InitialView. > > > > The view mapping methods I found (e.g. using and AggregateBuilder) work > on > > AE descriptions, so I would need to create additional instances (with the > > corresponding memory overhead). Is there a way to remap/rename the views > in > > a JCas before calling process() so that the desired view is seen as the > > _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used > > for this, but it doesn't feel quite right. > > > > Best, > > Jens > > >
run existing AE instance on different view
Hi, I'm trying to run an already instantiated AE on a view other than _InitialView. Unfortunately, I can't just call process() on the desired view, as there is a call to Util.getStartingView(...) in PrimitiveAnalysisEngine_impl that forces it back to _InitialView. The view mapping methods I found (e.g. using and AggregateBuilder) work on AE descriptions, so I would need to create additional instances (with the corresponding memory overhead). Is there a way to remap/rename the views in a JCas before calling process() so that the desired view is seen as the _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used for this, but it doesn't feel quite right. Best, Jens
Re: Run an analysis engine after processing document collection?
Hi Ben, if I understand correctly you want to run a process once the whole collection has been analyzed. You can have an AnalysisEngine that does this by implementing http://uima.apache.org/d/uimaj-2.10.0/apidocs/org/apache/uima/analysis_engine/AnalysisEngine.html#collectionProcessComplete() You just need to make sure that you gather all the necessary information somehow. If the AE that calculates the statistics is at the end of the pipeline and you have only one instance of it it's easy to gather all the information there. Or you could just write everything you need to a centralized datastore (i.e. a database) and use that to calculate the statistics. If I didn't misunderstand you, that's really a quite common scenario. Best, Jens On Fri, Dec 22, 2017 at 6:26 PM, Benedict Holland < benedict.m.holl...@gmail.com> wrote: > Hello All, > > I find myself in a strange situation. I have a content processing engine > working. I have N threads populating N CAS objects and running my pipeline. > Each CAS object gets 1 piece of data, like say a row in a database. Each > process is entirely independent and can run concurrently. I specifically > did not configure this pipeline as an aggregate process as I don't really > care when the events trigger since the CPE maintains the order of the > engines. > > Now I want to add an analysis that will run over the aggregate output. For > example, I processed N texts using the CPE and now I want to run a TF-IDF > analysis over the entire corpora. The TF-IDF analysis should only run once > all documents are processed. > > How would I go about doing this? Does this have to do with not allowing > multiple deployments? > > Thanks, > ~Ben >
Re: Parameters for PEAR
Is there a specific reason to use PEARs? As far as I remember (but I could be wrong, it's been a few years), the main advantages of using them (automatic class path configuration, some degree of isolation between components) was lost when we wanted to change configuration parameters because then we would need to use the AE descriptor instead of the PEAR descriptor (at least with CPE). If you're not going to use the PEAR descriptor then an installed PEAR is not much more than a bunch of JARs, and component descriptors with tons of hard-coded absolute file paths, so you should be able to just use and configure a component based on those descriptors (without anything PEAR-specific). We have since switched to doing everything with uimaFIT which gives you many many possibilities to adapt your workflow, configure engines programatically, etc. For us the change has been hugely positive, both for development (and debugging) and for deployment in a wide variety of ways and environments. Best, Jens On Tue, Dec 12, 2017 at 8:39 AM, Matthias Kochwrote: > Hi, > > I want to configure a PEAR dynamically. (I install the pear and want to > produce the analysis engine with different parameters than in the xml). > Is this possible? Can I use the additionalParameters? I have seen that the > PearSpecifier has an instance variable for parameters, but no one is using > (calling) it. > > I want to produce the analysisEngine with: > UIMAFramework.produceAnalysisEngine(resourceSpecifer, > resourceManager, params); > > In this specifier there should be one or more pearSpecifiers that should > be configured. > > I have overridden the PearAnalysisEngineWrapper and built a loop that > configures the following specifier over the configurationParameterSettings. > It takes the parameters from the pear specifiers. > > line 257-258 > // Parse the resource specifier > ResourceSpecifier specifier = UIMAFramework.getXMLParser().p > arseResourceSpecifier(in); > > ==> added code > AnalysisEngineDescription analysisEngineDescription = > (AnalysisEngineDescription) specifier; > AnalysisEngineMetaData analysisEngineMetaData = > analysisEngineDescription.getAnalysisEngineMetaData(); > ConfigurationParameterSettings configurationParameterSettings = > analysisEngineMetaData.getConfigurationParameterSettings(); > for (Parameter parameter : Arrays.asList(pearSpec.getParameters())) { > > configurationParameterSettings.setParameterValue(parameter.getName(), > parameter.getValue()); > } > > Is it possible without overriding anything? > > UIMAJ Version: 2.10 > > Sincerely > Matthias > > -- > Matthias Koch > > Averbis GmbH > Tennenbacher Str. 11 > 79106 Freiburg > Germany > > Fon: +49 761 708 394 0 > Fax: +49 761 708 394 10 > Email:matthias.k...@averbis.com > Web:https://averbis.com > > Headquarters: Freiburg im Breisgau > Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 > Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó > >
Re: General question about UimaFIT
And I guess you don't get JCAS classes for your type system without going through JCasGen, which is another disadvantage to generating the types on the fly. It also kind of goes against the fact that the type system should be something you can rely on for communication between components, so it would tend to be static. Just out of curiosity, what's the use case for this (except maybe unit testing as Armin mentioned)? Best, Jens On Fri, Sep 9, 2016 at 4:31 PM, Richard Eckart de Castilhowrote: > On 09.09.2016, at 13:39, Asher Stern wrote: > > > > Hi Armin. > > Thanks for your quick answer! > > > > While the workaround is indeed helpful, I am still curios why is there no > > regular mechanism to define new types and create new descriptors > > programmatically, much like all other UIMA components? > > Sure you can define types programmatically... it's just that for the > case of types, defining them through XML is actually more convenient. > Mind that the type-system is implementation independent! You can think > of it as of an DTD or XSD. > > If you want to programmatically create a type, you can do this: > > TypeSystemDescription tsd = new TypeSystemDescription_impl(); > TypeDescription tokenTypeDesc = tsd.addType("Token", "", > CAS.TYPE_NAME_ANNOTATION); > tokenTypeDesc.addFeature("length", "", CAS.TYPE_NAME_INTEGER); > > CAS cas = CasCreationUtils.createCas(tsd, null, null); > cas.setDocumentText("This is a test."); > > Check out [1] slides 20 following. > > Cheers, > > -- Richard > > [1] https://github.com/dkpro/dkpro-tutorials/blob/master/ > GSCL2013/tags/latest/slides/GSCL2013UIMATutorialUKP.pdf
Re: CPE memory usage
Hi Armin, glad I could help. Getting all IDs first also avoids problems with changing data which could mess with the offsets. This way you have a fixed snapshot of all existing documents (at the beginning). Best, Jens On Mon, Aug 29, 2016 at 8:12 AM, <armin.weg...@bka.bund.de> wrote: > Hi Jens, > > I just want to confirm your information. As you said, the query gets > slower the larger start is, even using filters. The best solution is to get > all ids first (may take some time), and then to get each documents by id > successively. There is a request handler (get) and a Java API method > (HttpSolrClient.getById()) to do so. > > Thanks to your help, I have a constantly fast queries, now. > > Cheers, > Armin > > -Ursprüngliche Nachricht- > Von: j...@grivolla.net [mailto:j...@grivolla.net] Im Auftrag von Jens > Grivolla > Gesendet: Dienstag, 16. August 2016 13:34 > An: user@uima.apache.org > Betreff: Re: CPE memory usage > > Solr is known not to be very good at deep paging, but rather getting the > top relevant results. Running a query asking for the millionth document is > pretty much the worst you can do as it will have to rank all documents > again, up to the millionth, and return that one. It can also be unreliable > if your document collection changes. > > We did get it to work quite well, though. I believe we used only filters > and retrieved the results in natural order, so that Solr wouldn't have to > rank the documents. We also had a version where we first retrieved all > matching document ids in one go, and then queried for the documents by id, > one by one, in getNext(). > > Deep paging has also seen some major improvements over time IIRC, so newer > Solr versions should perform much better than the ones from a few years > ago. > > Best, > Jens > > On Tue, Aug 9, 2016 at 12:20 PM, <armin.weg...@bka.bund.de> wrote: > > > Hi! > > > > Finally, it looks like that Solr causes the high memory consumption. The > > SolrClient isn't expected to be used like I did it. But it isn't > documented > > either. The Solr documentation is very bad. I just happened to find a > > solution on the web by accident. > > > > Thanks, > > Armin > > > > -Ursprüngliche Nachricht- > > Von: Richard Eckart de Castilho [mailto:r...@apache.org] > > Gesendet: Montag, 8. August 2016 15:33 > > An: user@uima.apache.org > > Betreff: Re: CPE memory usage > > > > Do you have code for a minimal test case? > > > > Cheers, > > > > -- Richard > > > > > On 08.08.2016, at 15:31, <armin.weg...@bka.bund.de> < > > armin.weg...@bka.bund.de> wrote: > > > > > > Hi Richard! > > > > > > I've changed the document reader to a kind of no-op-reader, that always > > sets the document text to an empty string: same behavior, but much slower > > increase in memory usage. > > > > > > Cheers, > > > Armin > > > > >
Re: CPE memory usage
Solr is known not to be very good at deep paging, but rather getting the top relevant results. Running a query asking for the millionth document is pretty much the worst you can do as it will have to rank all documents again, up to the millionth, and return that one. It can also be unreliable if your document collection changes. We did get it to work quite well, though. I believe we used only filters and retrieved the results in natural order, so that Solr wouldn't have to rank the documents. We also had a version where we first retrieved all matching document ids in one go, and then queried for the documents by id, one by one, in getNext(). Deep paging has also seen some major improvements over time IIRC, so newer Solr versions should perform much better than the ones from a few years ago. Best, Jens On Tue, Aug 9, 2016 at 12:20 PM,wrote: > Hi! > > Finally, it looks like that Solr causes the high memory consumption. The > SolrClient isn't expected to be used like I did it. But it isn't documented > either. The Solr documentation is very bad. I just happened to find a > solution on the web by accident. > > Thanks, > Armin > > -Ursprüngliche Nachricht- > Von: Richard Eckart de Castilho [mailto:r...@apache.org] > Gesendet: Montag, 8. August 2016 15:33 > An: user@uima.apache.org > Betreff: Re: CPE memory usage > > Do you have code for a minimal test case? > > Cheers, > > -- Richard > > > On 08.08.2016, at 15:31, < > armin.weg...@bka.bund.de> wrote: > > > > Hi Richard! > > > > I've changed the document reader to a kind of no-op-reader, that always > sets the document text to an empty string: same behavior, but much slower > increase in memory usage. > > > > Cheers, > > Armin > >
Re: Selecting all connected annotations by type.
Ok Richard, I'll look into it, but I don't promise anything at this point (tons of project deliverables coming up)... -- Jens On Fri, Oct 23, 2015 at 2:03 PM, Richard Eckart de Castilho <r...@apache.org> wrote: > Hi Jens, > > :) don't you want to test and apply it? My next projected time slot for > uimaFIT is in December. > > Best, > > -- Richard > > > On 23.10.2015, at 11:09, Jens Grivolla <j+...@grivolla.net> wrote: > > > > I'd really like to have that functionality also (we'll need to do > something > > like that quite soon), so I just voted on the issue... > > > > I haven't tested the patch yet. José, have you been using this over the > > last few months? > > > > -- Jens > >
Re: Selecting all connected annotations by type.
I'd really like to have that functionality also (we'll need to do something like that quite soon), so I just voted on the issue... I haven't tested the patch yet. José, have you been using this over the last few months? -- Jens On Sun, Feb 1, 2015 at 2:04 AM, José Tomás Atriawrote: > Issue created, patch submitted. > > https://issues.apache.org/jira/browse/UIMA-4212 > > On Sat Jan 31 2015 at 3:12:33 AM Richard Eckart de Castilho < > r...@apache.org> > wrote: > > > Dear José, > > > > could you please re-submit the patch via the Apache UIMA issue tracker: > > > > Thanks! > > > > -- Richard > > > > https://issues.apache.org/jira/browse/UIMA > > > > On 31.01.2015, at 05:38, José Tomás Atria wrote: > > > > > Please disregard the previous patch, apparently I managed to corrupt it > > while creating it over ssh. > > > > > > The version in this email should be correct, I hope. > > > > > > Best, > > > jta > > > > >
Re: Views or Separate CASes?
Hi Matt, As Richard said, using Views is more designed for having "parallel" information, such as separate layers of audio, transcript, video, etc. referring to the same content or "document". I'm not quite sure why you want to split your document for processing (which you could do with a CAS Multiplier). Wouldn't it be much easier to just maintain and process it as one document, marking the different segments with e.g. speaker information, etc.? I don't quite understand your need for splitting, your AEs can run on all the segments (and most can be instructed not to cross segment boundaries or only work at the sentence level anyway). Of course if what you want is to be able to search for and retrieve segments that pertain to different speakers then you will need to index your content in something like Solr outside of UIMA, and while you could use a CAS Multiplier and then index each generated CAS as a document, it is much easier to just have a CasConsumer that knows how to deal with your segment annotations and extracts the information you want to index in an appropriate form. You may want to look at our project EUMSSI (http://eumssi.eu/) which is about doing exactly this. You can find our initial design here: http://www.aclweb.org/anthology/W14-5212 which we presented at the last UIMA workshop (http://glicom.upf.edu/OIAF4HLT/) and some more documentation on https://github.com/EUMSSI/EUMSSI-platform/wiki. The segment indexing is not in there yet, but I expect to put something on Github in the next one or two weeks. Best, Jens On Wed, Aug 26, 2015 at 4:45 PM, Matthew DeAngeliswrote: > Hello UIMA Gurus, > > I am relatively new to UIMA, so please excuse the general nature of my > question and any butchering of the terminology. > > I am attempting to write an application to process transcripts of audio > files. Each "raw" transcript is in its own HTML file with a section listing > biographical information for the speakers on the call followed by a number > of sections containing transcriptions of the discussion of different > topics. I would like to be able to analyze each speaker's contributions > separately by topic and then aggregate and compare these analyses between > speakers and between each speaker and the full text. I was thinking that I > would break the document into a new segment each time the speaker or the > section of the document changes (attaching relevant speaker metadata to > each section), run additional Analysis Engines on each segment (tokenizer, > etc.), and then arbitrarily recombine the results of the analysis by > speaker, etc. > > Looking through the documentation, I am considering two approaches: > > 1. Using a CAS Multiplier. Under this approach, I would follow the example > in Chapter 7 of the documentation, divide on section and speaker > demarcations, add metadata to each CAS, run additional AEs on the CASes, > and then use a multiplier to recombine the many CASes for each document > (one for the whole transcript, one for each section, one for each speaker, > etc.). The advantage of this approach is that it seems easy to incorporate > into a pipeline of AEs, since they are designed to run on each CAS. The > disadvantage is that it seems unwieldy to have to keep track of all of the > related CASes per document and aggregate statistics across the CASes. > > 2. Use CAS Views. This option is appealing because it seems like CAS Views > were designed for associating many different aspects of the same document > with one another. However, it looks to me that I would have to specify > different views both when parsing the document into sections and when > passing them through subsequent AEs, which would make it harder to drop > into an existing pipeline. I may be misunderstanding how subsequent AEs > work with Views, however. > > For those more experience with UIMA, how would you approach this problem? > It's entirely possible that I am missing a third (fourth, fifth...) > approach that would work better than either of those above, so any guidance > would be much appreciated. > > > Regards and thanks, > Matt >
Re: Dictionary Matching using Concept Mapper for single word entry.
Hi Khirod, could it be that your single-word document doesn't get marked as a sentence? You have SpanFeatureStructure set to com.naukri.parse.type.Sentence, so ConceptMapper only works on things that are within a Sentence annotation. Tokens that are not part of a sentence will not be seen at all. This has happened to us when working on malformed text where some sentence segmenters would leave parts of the text unmarked. Best, Jens On Sun, Jul 19, 2015 at 4:00 PM, Khirod Kant Naik kkantn...@gmail.com wrote: Hi everyone, I am unable to match text from dictionary if the enclosing span contains only a single token. For example - I am trying to match word education from my dictionary and for the enclosing span I am using a sentence. So if sentence contains a single token then I am not able to match it from dictionary. Here is what I have tried, When I have a sentence like - Education **something else** then conceptMapper matches education. While if I have a sentence like - Education then conceptMapper is not picking it from dictionary. So I have a question that *does conceptMapper requires you to have more than 1 TokenAnnotation within the specified spanFeatureStructure ? * P.S : This is the descriptor I am using ?xml version=1.0 encoding=UTF-8? taeDescription xmlns=http://uima.apache.org/resourceSpecifier; frameworkImplementationorg.apache.uima.java/frameworkImplementation primitivetrue/primitive annotatorImplementationNameorg.apache.uima.conceptMapper.ConceptMapper/annotatorImplementationName analysisEngineMetaData nameSegment Heading Annotator/name description/ version1/version vendor/ configurationParameters configurationParameter namecaseMatch/name descriptionthis parameter specifies the case folding mode: ignoreall - fold everything to lowercase for matching insensitive - fold only tokens with initial caps to lowercase digitfold - fold all (and only) tokens with a digit sensitive - perform no case folding/description typeString/type multiValuedfalse/multiValued mandatorytrue/mandatory /configurationParameter configurationParameter nameStemmer/name descriptionName of stemmer class to use before matching. MUST have a zero-parameter constructor! If not specified, no stemming will be performed./description typeString/type multiValuedfalse/multiValued mandatoryfalse/mandatory /configurationParameter configurationParameter nameResultingAnnotationName/name descriptionName of the annotation type created by this TAE, must match the typeSystemDescription entry/description typeString/type multiValuedfalse/multiValued mandatorytrue/mandatory /configurationParameter configurationParameter nameResultingEnclosingSpanName/name descriptionName of the feature in the resultingAnnotation to contain the span that encloses it (i.e. its sentence)/description typeString/type multiValuedfalse/multiValued mandatoryfalse/mandatory /configurationParameter configurationParameter nameAttributeList/name descriptionList of attribute names for XML dictionary entry record - must correspond to FeatureList/description typeString/type multiValuedtrue/multiValued mandatorytrue/mandatory /configurationParameter configurationParameter nameFeatureList/name descriptionList of feature names for CAS annotation - must correspond to AttributeList/description typeString/type multiValuedtrue/multiValued mandatorytrue/mandatory /configurationParameter configurationParameter nameTokenAnnotation/name description/ typeString/type multiValuedfalse/multiValued mandatorytrue/mandatory /configurationParameter configurationParameter nameTokenClassFeatureName/name descriptionName of feature used when doing lookups against IncludedTokenClasses and ExcludedTokenClasses/description typeString/type multiValuedfalse/multiValued mandatoryfalse/mandatory /configurationParameter configurationParameter nameTokenTextFeatureName/name description/ typeString/type multiValuedfalse/multiValued mandatoryfalse/mandatory /configurationParameter configurationParameter nameSpanFeatureStructure/name
Re: UIMAfit analysis descriptions appear to trim String configuration parameters
On Mon, Jun 15, 2015 at 8:43 AM, Mario Gazzo mario.ga...@gmail.com wrote: I am referring to to this Github repo: https://github.com/apache/uima-uimafit https://github.com/apache/uima-uimafit Thought it was published by you as a mirror of the SVN repo or the other way around. Yes, this is the official (one-way) mirror of the SVN repository. If you want to be able to reference SVN commits you can look at the commit details on Github: https://github.com/apache/uima-uimafit/commit/e9b32e30895443b9f93fef65453593dd1533c7d0 There you see: git-svn-id: https://svn.apache.org/repos/asf/uima/uimafit/trunk@1681410 13f79535-47bb-0310-9956-ffa450edef68 Unfortunately, the link doesn't actually work with the repository browser at svn.apache.org, but at least the commit id should be correct. The correspondence between commits in SVN and git is a bit complicated because there is only one big SVN repository for all of UIMA, whereas there are separate git repositories for the subprojects. Therefore the commit you reference is the latest one in the uimaFIT git repository, but there are newer commits in the UIMA SVN. HTH, Jens
Re: Approach for keeping track of formatting associated with text views
Hi Peter, while I don't think I will be using the HtmlConverter right away, I would vote for using the length of the document annotation for annotations that relate to the whole document (such as metadata). That makes them show up nicely in the CasEditor/Viewer and you could maintain it in all segments when you split a CAS (e.g. with something based on the SimpleTextSegmenter example). -- Jens On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl pklu...@uni-wuerzburg.de wrote: Hi, there is no way yet to customize this behavior. The HtmlConverter only retains annotation of a length 0 since annoations with length == 0 are rather problematic and should be avoided. I can add a configuration parameter for keeping these annoations if you want (best open an issue for it). What should be the offsets of the annotations for elements in the head of the html document? 0, those of the first token or those of the document annotation? Best, Peter Am 06.03.2015 um 14:00 schrieb Mario Gazzo: We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but we ran into an issue with the converter. It appears to only convert tag annotations that surround or are inside the body tag. Metadata elements like citations are ignored. The only way to get around this seems to be by forking and modifying the codebase, which I like to avoid. Both modules seem otherwise very useful to us but I am looking for a better approach to solve this issue. Is there some way to customise this behaviour without code modifications? Your input is appreciated, thanks. On 18 Feb 2015, at 23:03 , Mario Gazzo mario.ga...@gmail.com wrote: Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it. On 18 Feb 2015, at 21:58 , Peter Klügl pklu...@uni-wuerzburg.de wrote: Hi, you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1] The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document. Best, Peter [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html# ugr.tools.ruta.ae.html Am 18.02.2015 um 21:46 schrieb Mario Gazzo: We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA. In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer. Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible. This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach. Any feedback is greatly appreciated. Thanks.
Re: Ruta parallel execution
Hi Silvestre, there doesn't seem to be anything RUTA-specific in your question. In principle, UIMA-AS allows parallel scaleout and merges the results (though I personally have never used it this way), but there are of course a few things to take into account. First, you will of course need to properly define the dependencies between your different analysis engines to ensure you always have all then necessary information available, meaning that you can only run things in parallel that are independent of one another. And then you will have to see if the overhead from distributing your CAS to several engines running in parallel and then merging the results is not greater than just having it in one colocated pipeline that can pass the information more efficiently. I guess you'll have to benchmark your specific application, but maybe somebody with more experience can give you some general directions... Best, Jens On Thu, Dec 18, 2014 at 12:26 PM, Silvestre Losada silvestre.los...@gmail.com wrote: Well let me explain. Ruta scripts are really good to work over output of analysis engines, each analysis engine will make some atomic work and using ruta rules you can easily work over generated annotations combine them, remove them... What I need is to execute several analysis engines in parallel to improve the response time, so now the analysis engines are executed sequentially and I want to execute them in parallel, then take the output of all of them and apply some ruta rules to the output. would it be possible. On 17 December 2014 at 18:13, Peter Klügl pklu...@uni-wuerzburg.de wrote: Hi, I haven't used UIMA-AS (with ruta) in a real application yet, but I tested it once for an rc. Did you face any problems? Best Peter Am 17.12.2014 14:34, schrieb Silvestre Losada: Hi All, Is there any way to execute ruta scripts in parallel, using uima-AS aproach? in case yes could you provide me an example. Kind regards.
Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT
The workshop program, along with links to the full papers, is now available: http://glicom.upf.edu/OIAF4HLT/Program.html I'm looking forward to seeing many of you there. I'll be staying at DCU (College Park). -- Jens On Tue, Jul 1, 2014 at 6:52 PM, Jens Grivolla j+...@grivolla.net wrote: The list of accepted papers is now available: http://glicom.upf.edu/OIAF4HLT/Papers.html For anybody interested in attending the workshop and COLING, please remember that the early registration deadline is tomorrow, July 2nd. Looking forward to seeing many of you there... -- Jens On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote: Workshop on Open Infrastructures and Analysis Frameworks for HLT http://glicom.upf.edu/OIAF4HLT/ At the 25th International Conference on Computational Linguistics (COLING 2014) Helix Conference Centre at Dublin City University (DCU) 23-29 August 2014 Description --- Recent advances in digital storage and networking, coupled with the extension of human language technologies (HLT) into ever broader areas and the persistence of difficulties in software portability, have led to an increased focus on development and deployment of web-based infrastructures that allow users to access tools and other resources and combine them to create novel solutions that can be efficiently composed, tuned, evaluated, disseminated and consumed. This in turn engenders collaborative development and deployment among individuals and teams across the globe. It also increases the need for robust, widely available evaluation methods and tools, means to achieve interoperability of software and data from diverse sources, means to handle licensing for limited access resources distributed over the web, and, perhaps crucially, the need to develop strategies for multi-site collaborative work. For many decades, NLP has suffered from low software engineering standards causing a limited degree of re-usability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at different sites. In recent years, two main frameworks, UIMA and GATE, have emerged that aim to allow the easy integration of varied tools through common type systems and standardized communication methods for components analysing unstructured textual information, such as natural language. Both frameworks offer a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community have adopted one of these frameworks as a platform for facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration. Analysis frameworks also reduce the problem of reproducibility of NLP results by formalising solution composition and making language processing tools shareable. Very recently, several efforts have been devoted to the development of web service platforms for NLP. These platforms exploit the growing number of web-based tools and services available for tasks related to HLT, including corpus annotation, configuration and execution of NLP pipelines, and evaluation of results and automatic parameter tuning. These platforms can also integrate modules and pipelines from existing frameworks such as UIMA and GATE, in order to achieve interoperability with a wide variety of modules from different sources. Many of the issues and challenges surrounding these developments have been addressed individually in particular projects and workshops, but there are ramifications that cut across all of them. We therefore feel that this is the moment to bring together participants representing the range of interests that comprise the comprehensive picture for community-driven, distributed, collaborative, web-based development and use for language processing software and resources. This includes those engaged in development of infrastructures for HLT as well as those who will use these services and infrastructures, especially for multi-site collaborative work. ### Workshop Objectives The overall goal of this workshop is to provide a forum for discussion of the requirements for an envisaged open “global laboratory” for HLT research and development and establish the basis of a community effort to develop and support it. To this end, the workshop will include
Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT
The list of accepted papers is now available: http://glicom.upf.edu/OIAF4HLT/Papers.html For anybody interested in attending the workshop and COLING, please remember that the early registration deadline is tomorrow, July 2nd. Looking forward to seeing many of you there... -- Jens On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote: Workshop on Open Infrastructures and Analysis Frameworks for HLT http://glicom.upf.edu/OIAF4HLT/ At the 25th International Conference on Computational Linguistics (COLING 2014) Helix Conference Centre at Dublin City University (DCU) 23-29 August 2014 Description --- Recent advances in digital storage and networking, coupled with the extension of human language technologies (HLT) into ever broader areas and the persistence of difficulties in software portability, have led to an increased focus on development and deployment of web-based infrastructures that allow users to access tools and other resources and combine them to create novel solutions that can be efficiently composed, tuned, evaluated, disseminated and consumed. This in turn engenders collaborative development and deployment among individuals and teams across the globe. It also increases the need for robust, widely available evaluation methods and tools, means to achieve interoperability of software and data from diverse sources, means to handle licensing for limited access resources distributed over the web, and, perhaps crucially, the need to develop strategies for multi-site collaborative work. For many decades, NLP has suffered from low software engineering standards causing a limited degree of re-usability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at different sites. In recent years, two main frameworks, UIMA and GATE, have emerged that aim to allow the easy integration of varied tools through common type systems and standardized communication methods for components analysing unstructured textual information, such as natural language. Both frameworks offer a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community have adopted one of these frameworks as a platform for facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration. Analysis frameworks also reduce the problem of reproducibility of NLP results by formalising solution composition and making language processing tools shareable. Very recently, several efforts have been devoted to the development of web service platforms for NLP. These platforms exploit the growing number of web-based tools and services available for tasks related to HLT, including corpus annotation, configuration and execution of NLP pipelines, and evaluation of results and automatic parameter tuning. These platforms can also integrate modules and pipelines from existing frameworks such as UIMA and GATE, in order to achieve interoperability with a wide variety of modules from different sources. Many of the issues and challenges surrounding these developments have been addressed individually in particular projects and workshops, but there are ramifications that cut across all of them. We therefore feel that this is the moment to bring together participants representing the range of interests that comprise the comprehensive picture for community-driven, distributed, collaborative, web-based development and use for language processing software and resources. This includes those engaged in development of infrastructures for HLT as well as those who will use these services and infrastructures, especially for multi-site collaborative work. ### Workshop Objectives The overall goal of this workshop is to provide a forum for discussion of the requirements for an envisaged open “global laboratory” for HLT research and development and establish the basis of a community effort to develop and support it. To this end, the workshop will include both presentations addressing the issues and challenges of developing, deploying, and using the global laboratory for distributed and collaborative efforts and discussion that will identify next steps for moving forward, fostering community-wide awareness, and establishing and encouraging
Last chance: Workshop on Open Infrastructures and Analysis Frameworks for HLT
Hello all, on request of several people who are just now getting back from LREC, we have again extended the deadline for the Workshop on Open Infrastructures and Analysis Frameworks for HLT. The new paper submission deadline is June 10th, 2014 This is looking to be a very nice workshop, with a strong UIMA presence as well as a chance to see how other frameworks deal with many of the same issues that we encounter. I hope to see many of you there. And thanks to those who have already submitted their paper to the workshop. :-) -- Jens On Thu, May 1, 2014 at 12:13 AM, Jens Grivolla j+...@grivolla.net wrote: The submission deadline for the workshop was just extended significantly to align with some of the other COLING 2014 workshop. The new dates are: Paper Submission Deadline: 1st June 2014 Author Notification Deadline: 30th June 2014 Camera-Ready Paper Deadline: 10th July 2014 Workshop: 23rd August 2014 You can find the workshop description and CFP at http://glicom.upf.edu/OIAF4HLT/ I hope to see you there and look forward to your contributions. -- Jens On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote: Workshop on Open Infrastructures and Analysis Frameworks for HLT http://glicom.upf.edu/OIAF4HLT/ At the 25th International Conference on Computational Linguistics (COLING 2014) Helix Conference Centre at Dublin City University (DCU) 23-29 August 2014 Description --- Recent advances in digital storage and networking, coupled with the extension of human language technologies (HLT) into ever broader areas and the persistence of difficulties in software portability, have led to an increased focus on development and deployment of web-based infrastructures that allow users to access tools and other resources and combine them to create novel solutions that can be efficiently composed, tuned, evaluated, disseminated and consumed. This in turn engenders collaborative development and deployment among individuals and teams across the globe. It also increases the need for robust, widely available evaluation methods and tools, means to achieve interoperability of software and data from diverse sources, means to handle licensing for limited access resources distributed over the web, and, perhaps crucially, the need to develop strategies for multi-site collaborative work. For many decades, NLP has suffered from low software engineering standards causing a limited degree of re-usability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at different sites. In recent years, two main frameworks, UIMA and GATE, have emerged that aim to allow the easy integration of varied tools through common type systems and standardized communication methods for components analysing unstructured textual information, such as natural language. Both frameworks offer a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community have adopted one of these frameworks as a platform for facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration. Analysis frameworks also reduce the problem of reproducibility of NLP results by formalising solution composition and making language processing tools shareable. Very recently, several efforts have been devoted to the development of web service platforms for NLP. These platforms exploit the growing number of web-based tools and services available for tasks related to HLT, including corpus annotation, configuration and execution of NLP pipelines, and evaluation of results and automatic parameter tuning. These platforms can also integrate modules and pipelines from existing frameworks such as UIMA and GATE, in order to achieve interoperability with a wide variety of modules from different sources. Many of the issues and challenges surrounding these developments have been addressed individually in particular projects and workshops, but there are ramifications that cut across all of them. We therefore feel that this is the moment to bring together participants representing the range of interests that comprise the comprehensive picture for community-driven, distributed, collaborative, web-based development and use for language processing software
Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT
The submission deadline for the workshop was just extended significantly to align with some of the other COLING 2014 workshop. The new dates are: Paper Submission Deadline: 1st June 2014 Author Notification Deadline: 30th June 2014 Camera-Ready Paper Deadline: 10th July 2014 Workshop: 23rd August 2014 You can find the workshop description and CFP at http://glicom.upf.edu/OIAF4HLT/ I hope to see you there and look forward to your contributions. -- Jens On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote: Workshop on Open Infrastructures and Analysis Frameworks for HLT http://glicom.upf.edu/OIAF4HLT/ At the 25th International Conference on Computational Linguistics (COLING 2014) Helix Conference Centre at Dublin City University (DCU) 23-29 August 2014 Description --- Recent advances in digital storage and networking, coupled with the extension of human language technologies (HLT) into ever broader areas and the persistence of difficulties in software portability, have led to an increased focus on development and deployment of web-based infrastructures that allow users to access tools and other resources and combine them to create novel solutions that can be efficiently composed, tuned, evaluated, disseminated and consumed. This in turn engenders collaborative development and deployment among individuals and teams across the globe. It also increases the need for robust, widely available evaluation methods and tools, means to achieve interoperability of software and data from diverse sources, means to handle licensing for limited access resources distributed over the web, and, perhaps crucially, the need to develop strategies for multi-site collaborative work. For many decades, NLP has suffered from low software engineering standards causing a limited degree of re-usability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at different sites. In recent years, two main frameworks, UIMA and GATE, have emerged that aim to allow the easy integration of varied tools through common type systems and standardized communication methods for components analysing unstructured textual information, such as natural language. Both frameworks offer a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community have adopted one of these frameworks as a platform for facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration. Analysis frameworks also reduce the problem of reproducibility of NLP results by formalising solution composition and making language processing tools shareable. Very recently, several efforts have been devoted to the development of web service platforms for NLP. These platforms exploit the growing number of web-based tools and services available for tasks related to HLT, including corpus annotation, configuration and execution of NLP pipelines, and evaluation of results and automatic parameter tuning. These platforms can also integrate modules and pipelines from existing frameworks such as UIMA and GATE, in order to achieve interoperability with a wide variety of modules from different sources. Many of the issues and challenges surrounding these developments have been addressed individually in particular projects and workshops, but there are ramifications that cut across all of them. We therefore feel that this is the moment to bring together participants representing the range of interests that comprise the comprehensive picture for community-driven, distributed, collaborative, web-based development and use for language processing software and resources. This includes those engaged in development of infrastructures for HLT as well as those who will use these services and infrastructures, especially for multi-site collaborative work. ### Workshop Objectives The overall goal of this workshop is to provide a forum for discussion of the requirements for an envisaged open “global laboratory” for HLT research and development and establish the basis of a community effort to develop and support it. To this end, the workshop will include both presentations addressing the issues and challenges of developing, deploying, and using the global laboratory
Re: next UIMA workshop?
On Mon, Mar 31, 2014 at 10:12 PM, Marshall Schor m...@schor.com wrote: On 3/26/2014 9:44 AM, Jens Grivolla wrote: Finally, despite the fact that UIMA does not appear in the title anymore, would it be possible to have an announcement on the UIMA web page? I think so (unless others disagree). Can you draft something? I tried to prepare a draft for svnpubsub to see how it fits with the UIMA site (without linking to it at first, of course), and created uima-website/xdocs/coling14.xml It then seems that I need to rebuild the site on my machine with ANT and push the resulting changes in docs/, which I did. The resulting page can be seen at http://uima.apache.org/coling14.html and looks more or less ok. I hope I didn't do anything wrong by committing directly to the site, but I didn't find a good way to try it in the actual page layout and show the results otherwise. In any case it's not linked from anywhere and shouldn't affect any other parts of the site. -- Jens
Re: next UIMA workshop?
Hi all, I have just posted the (more or less) final CFP on uima-user and uima-dev. Feel free to distribute the CFP to anybody you think would be interested. While this has been merged with a different workshop and thus has a somewhat wider scope than just UIMA, I still view this as a followup to the the UIMA workshop at GSCL and would hope to have similarly interesting contributions from the UIMA community. If you are a PC member, or willing to be one, please contact me off-list with the email address and affiliation that you would like me to use for this purpose. Finally, despite the fact that UIMA does not appear in the title anymore, would it be possible to have an announcement on the UIMA web page? -- Jens On 05/02/14 11:46, Jens Grivolla wrote: We have been asked to merge our workshop with a similar one focusing on open infrastructures. The result is a Workshop on Open Infrastructures and Analysis Frameworks for HLT. We will now start to build a common CFP from the two proposals. All contributions are welcome: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/cfp.md -- Jens On 19/01/14 15:40, Jens Grivolla wrote: I have sent the proposal, we'll see what they say... -- Jens On 17/01/14 15:02, Jens Grivolla wrote: On 15/01/14 20:51, Richard Eckart de Castilho wrote: On 15.01.2014, at 15:10, Jens Grivolla j+...@grivolla.net wrote: The CFP itself must still be rewritten to be less UIMA-centric, other than that this is starting to look quite good. GATE developer Mark A. Greenwood did the rewrite and sent me a pull request on Github. For example, the topic experience reports combining UIMA-based components from different sources, as well as solutions to interoperability issues could be reworded as: 1) experience reports combining language analysis components from different sources, as well as solutions to interoperability issues 2) experience reports combining different frameworks (e.g. GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues I put both in there as separate points. I think both aspects would be interesting. I'm a little afraid that 1) might end up iterating the existing of frameworks like UIMA, while 2) would end up referring over web-services or semantic web stuff for interoperability - which may not be very interesting. I'd be more interested in issues and solutions exist beyond this, e.g. with regards to the interchangability of components. What problems exist when e.g. one parser component in a workflow is replaced with a different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]). Agree. Subtle semantic differences between alternative components can be more challenging than the technical integration. I'm not sure how to put that in the CFP without it getting very verbose, though. I think one more topic could be added: - combining annotation type systems in processing frameworks (GATE, UIMA, etc.) with standardization efforts, such as done in the ISO TC37/SC4 or TEI contexts. Done. Thanks for your input. As always, the current state of the proposal can be seen on Github: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md I think the current version is pretty close to final. If there are any more suggestions hurry up, the deadline is approaching. -- Jens
Re: next UIMA workshop?
We have been asked to merge our workshop with a similar one focusing on open infrastructures. The result is a Workshop on Open Infrastructures and Analysis Frameworks for HLT. We will now start to build a common CFP from the two proposals. All contributions are welcome: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/cfp.md -- Jens On 19/01/14 15:40, Jens Grivolla wrote: I have sent the proposal, we'll see what they say... -- Jens On 17/01/14 15:02, Jens Grivolla wrote: On 15/01/14 20:51, Richard Eckart de Castilho wrote: On 15.01.2014, at 15:10, Jens Grivolla j+...@grivolla.net wrote: The CFP itself must still be rewritten to be less UIMA-centric, other than that this is starting to look quite good. GATE developer Mark A. Greenwood did the rewrite and sent me a pull request on Github. For example, the topic experience reports combining UIMA-based components from different sources, as well as solutions to interoperability issues could be reworded as: 1) experience reports combining language analysis components from different sources, as well as solutions to interoperability issues 2) experience reports combining different frameworks (e.g. GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues I put both in there as separate points. I think both aspects would be interesting. I'm a little afraid that 1) might end up iterating the existing of frameworks like UIMA, while 2) would end up referring over web-services or semantic web stuff for interoperability - which may not be very interesting. I'd be more interested in issues and solutions exist beyond this, e.g. with regards to the interchangability of components. What problems exist when e.g. one parser component in a workflow is replaced with a different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]). Agree. Subtle semantic differences between alternative components can be more challenging than the technical integration. I'm not sure how to put that in the CFP without it getting very verbose, though. I think one more topic could be added: - combining annotation type systems in processing frameworks (GATE, UIMA, etc.) with standardization efforts, such as done in the ISO TC37/SC4 or TEI contexts. Done. Thanks for your input. As always, the current state of the proposal can be seen on Github: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md I think the current version is pretty close to final. If there are any more suggestions hurry up, the deadline is approaching. -- Jens
Re: next UIMA workshop?
I have sent the proposal, we'll see what they say... -- Jens On 17/01/14 15:02, Jens Grivolla wrote: On 15/01/14 20:51, Richard Eckart de Castilho wrote: On 15.01.2014, at 15:10, Jens Grivolla j+...@grivolla.net wrote: The CFP itself must still be rewritten to be less UIMA-centric, other than that this is starting to look quite good. GATE developer Mark A. Greenwood did the rewrite and sent me a pull request on Github. For example, the topic experience reports combining UIMA-based components from different sources, as well as solutions to interoperability issues could be reworded as: 1) experience reports combining language analysis components from different sources, as well as solutions to interoperability issues 2) experience reports combining different frameworks (e.g. GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues I put both in there as separate points. I think both aspects would be interesting. I'm a little afraid that 1) might end up iterating the existing of frameworks like UIMA, while 2) would end up referring over web-services or semantic web stuff for interoperability - which may not be very interesting. I'd be more interested in issues and solutions exist beyond this, e.g. with regards to the interchangability of components. What problems exist when e.g. one parser component in a workflow is replaced with a different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]). Agree. Subtle semantic differences between alternative components can be more challenging than the technical integration. I'm not sure how to put that in the CFP without it getting very verbose, though. I think one more topic could be added: - combining annotation type systems in processing frameworks (GATE, UIMA, etc.) with standardization efforts, such as done in the ISO TC37/SC4 or TEI contexts. Done. Thanks for your input. As always, the current state of the proposal can be seen on Github: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md I think the current version is pretty close to final. If there are any more suggestions hurry up, the deadline is approaching. -- Jens
Re: next UIMA workshop?
On 15/01/14 20:51, Richard Eckart de Castilho wrote: On 15.01.2014, at 15:10, Jens Grivolla j+...@grivolla.net wrote: The CFP itself must still be rewritten to be less UIMA-centric, other than that this is starting to look quite good. GATE developer Mark A. Greenwood did the rewrite and sent me a pull request on Github. For example, the topic experience reports combining UIMA-based components from different sources, as well as solutions to interoperability issues could be reworded as: 1) experience reports combining language analysis components from different sources, as well as solutions to interoperability issues 2) experience reports combining different frameworks (e.g. GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues I put both in there as separate points. I think both aspects would be interesting. I'm a little afraid that 1) might end up iterating the existing of frameworks like UIMA, while 2) would end up referring over web-services or semantic web stuff for interoperability - which may not be very interesting. I'd be more interested in issues and solutions exist beyond this, e.g. with regards to the interchangability of components. What problems exist when e.g. one parser component in a workflow is replaced with a different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]). Agree. Subtle semantic differences between alternative components can be more challenging than the technical integration. I'm not sure how to put that in the CFP without it getting very verbose, though. I think one more topic could be added: - combining annotation type systems in processing frameworks (GATE, UIMA, etc.) with standardization efforts, such as done in the ISO TC37/SC4 or TEI contexts. Done. Thanks for your input. As always, the current state of the proposal can be seen on Github: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md I think the current version is pretty close to final. If there are any more suggestions hurry up, the deadline is approaching. -- Jens
Re: next UIMA workshop?
Thanks, fixed. On 14/01/14 19:04, Peter Klügl wrote: Hi, Just a small correction: The last workshop had nine paper presentations and one invited talk. Best, Peter Am 14.01.2014 18:11, schrieb Jens Grivolla: Hello, there's only 5 days remaining to submit the workshop proposal. Please anybody interested get in touch. I sent a mail to the GATE user list to get some input from them. The proposal draft is here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md -- Jens On 19/12/13 13:29, Jens Grivolla wrote: On 19/12/13 13:08, Peter Klügl wrote: Am 19.12.2013 12:31, schrieb Jens Grivolla: Ok, it's time to seriously get started on this. I guess we can start with the GSCL workshop description, and maybe make it more inclusive for other frameworks (GATE, etc.) We need a couple of organizers (me, Renaud, ...?) and a potential PC (again, start with the one from GSCL) preferably with a few already confirmed PC members (Richard, ...) If the workshop is more inclusive for other frameworks, maybe it's reasonable to ask one of the GATE people whether they want to co-organize the workshop. Yes, we definitely would need to reach out to them. First we need to decide: do we want a more focused workshop (just UIMA), or are the problems faced by GATE users (and others) sufficiently similar that we can learn from each other? If we want to get the GATE people in there: does anybody have contacts in that community? I won't be able to help with the organization, but maybe as a part of the PC. I take that as having you as a confirmed PC member ;-) I can also not promise that I will submit something, but I will motivate our working group. Ok, that's great. I started the draft proposal here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop Thanks, Jens
Re: COLING 2014 - some information
Dear Luca and Sylvain, as you can see the workshop is still in the proposal phase. If it is accepted by the COLING organizers pricing etc. will be set by them. It will of course be possible to attend without presenting a paper, and on the other hand we are open to all kinds of contributions, and in particular related to industry use of UIMA. Best regards, Jens On 15/01/14 11:36, Sylvain Surcin wrote: Hello, I am also interested in joining this workshop about UIMA. We have been running a full UIMA driven processing chain in my company for years and are in the process of releasing some components as open source together with University of Marne-la-Vallée (France). It could be interesting to disseminate some info about that. Best regards, -- [+] Add me to your address bookhttps://ws.writethat.name/kwaga-bin/titan/WEB/me.pl/5075409511380703595/i Sylvain SURCIN, Ph.D. *KWAGA* Senior Software Architect 15, rue Jean-Baptiste Berlier 75013 Paris France Tél.: +33 (0)1.55.43.79.20 [+] Add me to your address bookhttps://ws.writethat.name/kwaga-bin/titan/WEB/me.pl/5075409511380703595/i Sylvain SURCIN, Ph.D. *KWAGA* Senior Software Architect 15, rue Jean-Baptiste Berlier 75013 Paris France Tél.: +33 (0)1.55.43.79.20 On Wed, Jan 15, 2014 at 11:16 AM, Luca Foppiano l...@foppiano.org wrote: Dear all, I'm a new member of this mailing list and new user of Apache UIMA. When starting using UIMA I'm facing exactly the problems listed in the introduction of the web page ( https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md ). :) I'm very interested to join this conference/workshop, I want to know if it is possible to join it as attendee. I'm not affiliated with any University or research centers. My plan is to participate to SEMEVAL, and since COLING share the location, to join it as well. Is there any limitation or price for it? Thanks in advance -- Luca Foppiano Software Engineer +31615253280 l...@foppiano.org www.foppiano.org
Re: next UIMA workshop?
Just a quick update: the proposal is progressing nicely, with very positive response from the GATE people. In fact, it will be co-organised by a GATE core team member and several core developers are on the PC. The CFP itself must still be rewritten to be less UIMA-centric, other than that this is starting to look quite good. Any input is welcome, so if you have any suggestions hurry up... -- Jens On 15/01/14 10:41, Jens Grivolla wrote: Thanks, fixed. On 14/01/14 19:04, Peter Klügl wrote: Hi, Just a small correction: The last workshop had nine paper presentations and one invited talk. Best, Peter Am 14.01.2014 18:11, schrieb Jens Grivolla: Hello, there's only 5 days remaining to submit the workshop proposal. Please anybody interested get in touch. I sent a mail to the GATE user list to get some input from them. The proposal draft is here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md -- Jens On 19/12/13 13:29, Jens Grivolla wrote: On 19/12/13 13:08, Peter Klügl wrote: Am 19.12.2013 12:31, schrieb Jens Grivolla: Ok, it's time to seriously get started on this. I guess we can start with the GSCL workshop description, and maybe make it more inclusive for other frameworks (GATE, etc.) We need a couple of organizers (me, Renaud, ...?) and a potential PC (again, start with the one from GSCL) preferably with a few already confirmed PC members (Richard, ...) If the workshop is more inclusive for other frameworks, maybe it's reasonable to ask one of the GATE people whether they want to co-organize the workshop. Yes, we definitely would need to reach out to them. First we need to decide: do we want a more focused workshop (just UIMA), or are the problems faced by GATE users (and others) sufficiently similar that we can learn from each other? If we want to get the GATE people in there: does anybody have contacts in that community? I won't be able to help with the organization, but maybe as a part of the PC. I take that as having you as a confirmed PC member ;-) I can also not promise that I will submit something, but I will motivate our working group. Ok, that's great. I started the draft proposal here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop Thanks, Jens
Re: next UIMA workshop?
Hello, there's only 5 days remaining to submit the workshop proposal. Please anybody interested get in touch. I sent a mail to the GATE user list to get some input from them. The proposal draft is here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md -- Jens On 19/12/13 13:29, Jens Grivolla wrote: On 19/12/13 13:08, Peter Klügl wrote: Am 19.12.2013 12:31, schrieb Jens Grivolla: Ok, it's time to seriously get started on this. I guess we can start with the GSCL workshop description, and maybe make it more inclusive for other frameworks (GATE, etc.) We need a couple of organizers (me, Renaud, ...?) and a potential PC (again, start with the one from GSCL) preferably with a few already confirmed PC members (Richard, ...) If the workshop is more inclusive for other frameworks, maybe it's reasonable to ask one of the GATE people whether they want to co-organize the workshop. Yes, we definitely would need to reach out to them. First we need to decide: do we want a more focused workshop (just UIMA), or are the problems faced by GATE users (and others) sufficiently similar that we can learn from each other? If we want to get the GATE people in there: does anybody have contacts in that community? I won't be able to help with the organization, but maybe as a part of the PC. I take that as having you as a confirmed PC member ;-) I can also not promise that I will submit something, but I will motivate our working group. Ok, that's great. I started the draft proposal here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop Thanks, Jens
Re: next UIMA workshop?
As I understand it, poster presentations are only used as a way to offload submissions that didn't make it as a full paper. I don't think that such a distinction is useful for this workshop and would prefer to have oral presentations for all interesting contributions. If we expected to have significantly more contributions that can fit into the schedule then concentrating some of them into a poster session might make sense, but I don't think this is the case. If on the other hand posters were used to get additional visibility outside of the workshop then this could be interesting... -- Jens On 14/01/14 18:36, Michael Tanenblatt wrote: I’ll certainly be on the Program Committee, and am willing to help in any ways that I am able. Regarding the proposal, overall it looks pretty reasonable, but what is the reason for limiting to oral presentations and omitting posters? ..m On Jan 14, 2014, at 12:11 PM, Jens Grivolla j+...@grivolla.net wrote: Hello, there's only 5 days remaining to submit the workshop proposal. Please anybody interested get in touch. I sent a mail to the GATE user list to get some input from them. The proposal draft is here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md -- Jens On 19/12/13 13:29, Jens Grivolla wrote: On 19/12/13 13:08, Peter Klügl wrote: Am 19.12.2013 12:31, schrieb Jens Grivolla: Ok, it's time to seriously get started on this. I guess we can start with the GSCL workshop description, and maybe make it more inclusive for other frameworks (GATE, etc.) We need a couple of organizers (me, Renaud, ...?) and a potential PC (again, start with the one from GSCL) preferably with a few already confirmed PC members (Richard, ...) If the workshop is more inclusive for other frameworks, maybe it's reasonable to ask one of the GATE people whether they want to co-organize the workshop. Yes, we definitely would need to reach out to them. First we need to decide: do we want a more focused workshop (just UIMA), or are the problems faced by GATE users (and others) sufficiently similar that we can learn from each other? If we want to get the GATE people in there: does anybody have contacts in that community? I won't be able to help with the organization, but maybe as a part of the PC. I take that as having you as a confirmed PC member ;-) I can also not promise that I will submit something, but I will motivate our working group. Ok, that's great. I started the draft proposal here: https://github.com/jgrivolla/coling2014-nlp-framework-workshop Thanks, Jens
Re: next UIMA workshop?
Ok, it's time to seriously get started on this. I guess we can start with the GSCL workshop description, and maybe make it more inclusive for other frameworks (GATE, etc.) We need a couple of organizers (me, Renaud, ...?) and a potential PC (again, start with the one from GSCL) preferably with a few already confirmed PC members (Richard, ...) I'll get started with a first draft. Any input is welcome. Please also indicate if you plan to submit an article, in order to have a first idea of what to expect... Thanks, Jens On 21/10/13 11:44, Jens Grivolla wrote: Hi, at GSCL 2013 we talked a bit about options for the next UIMA workshop. How about trying to have it at COLING 2014? WORKSHOP TIMELINE • 19th January 2014: Workshop proposals due • 26th January 2014: Notification of workshop acceptances • 18th July 2014: Camera-ready deadline for workshop proceedings • 23rd and 24th August 2014: COLING Workshops http://www.coling-2014.org/workshop-call.php So that would be approximately one year after the GSCL workshop which would probably give enough time for people to have new things to present, and there are still 3 months before submitting the workshop proposal. COLING is going to be in Dublin, which makes it relatively easy to attend for the European UIMA community. What do you think? Bye, Jens
Re: big offsets efficiency, and multiple offsets
I agree that it might make more sense to model our needs more directly instead of trying to squeeze it into the schema we normally use for text processing. But at the same time I would of course like to avoid having to reimplement many of the things that are already available when using AnnotationBase. For the cross-view indexing issue I was thinking of creating individual views for each modality and then a merged view that just contains a subset of annotations of each view, and on which we would do the cross-modal reasoning. I just looked again at the GaleMultiModalExample (not much there, unfortunately) and saw that e.g. AudioSpan derives from AnnotationBase but still has float values for begin/end. I would be really interested in learning more about what was done in GALE, but it's hard to find any relevant information... Thanks, Jens On 04/12/13 20:16, Marshall Schor wrote: Echoing Richard, 1) It would perhaps make more sense to be more direct about each of the different types of data. UIMA built-in only the most popular things - and Annotation was one of them. Annotation derives from Annotation-base, which just defines an associated Sofa / view. So it would make more sense to define different kinds of highest-level abstractions for your project, related to the different kinds of views/sofas. Audio might entail a begin / end style of offsets; Images might entail a pair x-y coordinates, to describe a (square) subset of an image. Video might do something like audio, or something more complex... UIMA's use of the AnnotationBase includes insuring that when you add-to-indexes (an operation that implicitly takes a view - and adds a FS to that view), that if the FS is a subtype of AnnotationBase, then the FS must be indexed in the associated view to which that FS belongs; if you try to add-to-index in a view other than the one the FS was created in, you get this kind of error: Error - the Annotation {0} is over view {1} and cannot be added to indexes associated with the different view {2}. The logic behind this restriction is: an Annotation (or, more generally, an object having a supertype of AnnotationBase) is (by definition) associated with a particular Sofa/View, and it is more likely that it is an error if that annotation is indexed with a sofa it doesn't belong with. Of course, Feature Structures which are not Annotations (or more generally, not derived from AnnotationBase), can be indexed in multiple views. 2) By keeping separate notions for pointers-into-the-Sofa, you can define algorithmic mappings for these that make the best sense for your project, including notions of fuzzyness, time-shift (imagine the audio is out-of-sync with the video, like lots of u-tube things seem to be), etc. -Marshall On 12/4/2013 9:31 AM, Jens Grivolla wrote: Hi, we're now starting the EUMSSI project, which deals with integrating annotation layers coming from audio, video and text analysis. We're thinking to base it all on UIMA, having different views with separate audio, video, transcribed text, etc. sofas. In order to align the different views we need to have a common offset specification that allows us to map e.g. character offsets to the corresponding timestamps. In order to avoid float timestamps (which would mean we can't derive from Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 frames/second. Annotation has begin and end defined as signed 32 bit ints, leaving sufficient room for very long documents even at 1000 fps, so I don't think we're going to run into any limits there. Is there anything that could become problematic when working with offsets that are probably quite a bit larger than what is typically found with character offsets? Also, can I have several indexes on the same annotations in order to work with character offsets for text analysis, but then efficiently query for overlapping annotations from other views based on frame offsets? Btw, if you're interested in the project we have a writeup (condensed from the project proposal) here: https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will hopefully soon be some content on http://eumssi.eu/ Thanks, Jens
Re: big offsets efficiency, and multiple offsets
I forgot to say that the text analysis view(s) will necessarily have to use character offsets so that we can obtain the coveredText, which means that all resulting annotations will also use character offsets. The merged view will need to use time-based offsets which means that we have to recreate the annotations there with mapped offsets rather than just index the same annotations in a different view. I think that basically means that we won't do much cross-view querying but rather have one component (AE) that reads from all views and creates a new one with new independent annotations after mapping the offsets. -- Jens On 05/12/13 10:04, Jens Grivolla wrote: I agree that it might make more sense to model our needs more directly instead of trying to squeeze it into the schema we normally use for text processing. But at the same time I would of course like to avoid having to reimplement many of the things that are already available when using AnnotationBase. For the cross-view indexing issue I was thinking of creating individual views for each modality and then a merged view that just contains a subset of annotations of each view, and on which we would do the cross-modal reasoning. I just looked again at the GaleMultiModalExample (not much there, unfortunately) and saw that e.g. AudioSpan derives from AnnotationBase but still has float values for begin/end. I would be really interested in learning more about what was done in GALE, but it's hard to find any relevant information... Thanks, Jens On 04/12/13 20:16, Marshall Schor wrote: Echoing Richard, 1) It would perhaps make more sense to be more direct about each of the different types of data. UIMA built-in only the most popular things - and Annotation was one of them. Annotation derives from Annotation-base, which just defines an associated Sofa / view. So it would make more sense to define different kinds of highest-level abstractions for your project, related to the different kinds of views/sofas. Audio might entail a begin / end style of offsets; Images might entail a pair x-y coordinates, to describe a (square) subset of an image. Video might do something like audio, or something more complex... UIMA's use of the AnnotationBase includes insuring that when you add-to-indexes (an operation that implicitly takes a view - and adds a FS to that view), that if the FS is a subtype of AnnotationBase, then the FS must be indexed in the associated view to which that FS belongs; if you try to add-to-index in a view other than the one the FS was created in, you get this kind of error: Error - the Annotation {0} is over view {1} and cannot be added to indexes associated with the different view {2}. The logic behind this restriction is: an Annotation (or, more generally, an object having a supertype of AnnotationBase) is (by definition) associated with a particular Sofa/View, and it is more likely that it is an error if that annotation is indexed with a sofa it doesn't belong with. Of course, Feature Structures which are not Annotations (or more generally, not derived from AnnotationBase), can be indexed in multiple views. 2) By keeping separate notions for pointers-into-the-Sofa, you can define algorithmic mappings for these that make the best sense for your project, including notions of fuzzyness, time-shift (imagine the audio is out-of-sync with the video, like lots of u-tube things seem to be), etc. -Marshall On 12/4/2013 9:31 AM, Jens Grivolla wrote: Hi, we're now starting the EUMSSI project, which deals with integrating annotation layers coming from audio, video and text analysis. We're thinking to base it all on UIMA, having different views with separate audio, video, transcribed text, etc. sofas. In order to align the different views we need to have a common offset specification that allows us to map e.g. character offsets to the corresponding timestamps. In order to avoid float timestamps (which would mean we can't derive from Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 frames/second. Annotation has begin and end defined as signed 32 bit ints, leaving sufficient room for very long documents even at 1000 fps, so I don't think we're going to run into any limits there. Is there anything that could become problematic when working with offsets that are probably quite a bit larger than what is typically found with character offsets? Also, can I have several indexes on the same annotations in order to work with character offsets for text analysis, but then efficiently query for overlapping annotations from other views based on frame offsets? Btw, if you're interested in the project we have a writeup (condensed from the project proposal) here: https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will hopefully soon be some content on http://eumssi.eu/ Thanks, Jens
big offsets efficiency, and multiple offsets
Hi, we're now starting the EUMSSI project, which deals with integrating annotation layers coming from audio, video and text analysis. We're thinking to base it all on UIMA, having different views with separate audio, video, transcribed text, etc. sofas. In order to align the different views we need to have a common offset specification that allows us to map e.g. character offsets to the corresponding timestamps. In order to avoid float timestamps (which would mean we can't derive from Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 frames/second. Annotation has begin and end defined as signed 32 bit ints, leaving sufficient room for very long documents even at 1000 fps, so I don't think we're going to run into any limits there. Is there anything that could become problematic when working with offsets that are probably quite a bit larger than what is typically found with character offsets? Also, can I have several indexes on the same annotations in order to work with character offsets for text analysis, but then efficiently query for overlapping annotations from other views based on frame offsets? Btw, if you're interested in the project we have a writeup (condensed from the project proposal) here: https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will hopefully soon be some content on http://eumssi.eu/ Thanks, Jens
Re: big offsets efficiency, and multiple offsets
True, but don't things like selectCovered() etc. expect Annotations (to match on begin/end)? So using Annotation might make it easier in some cases to select the annotations we're interested in. -- Jens On 04/12/13 15:35, Richard Eckart de Castilho wrote: Why is it bad if you cannot inherit from Annotation? The getCoveredText() will not work anyway if you are working with audio/video data. -- Richard On 04.12.2013, at 12:31, Jens Grivolla j+...@grivolla.net wrote: Hi, we're now starting the EUMSSI project, which deals with integrating annotation layers coming from audio, video and text analysis. We're thinking to base it all on UIMA, having different views with separate audio, video, transcribed text, etc. sofas. In order to align the different views we need to have a common offset specification that allows us to map e.g. character offsets to the corresponding timestamps. In order to avoid float timestamps (which would mean we can't derive from Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 frames/second. Annotation has begin and end defined as signed 32 bit ints, leaving sufficient room for very long documents even at 1000 fps, so I don't think we're going to run into any limits there. Is there anything that could become problematic when working with offsets that are probably quite a bit larger than what is typically found with character offsets? Also, can I have several indexes on the same annotations in order to work with character offsets for text analysis, but then efficiently query for overlapping annotations from other views based on frame offsets? Btw, if you're interested in the project we have a writeup (condensed from the project proposal) here: https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will hopefully soon be some content on http://eumssi.eu/ Thanks, Jens
Re: uimaFIT: managing component configurations
Hi, basically I'm looking for a way to manage engine descriptions. So far I'm using createEngineDescription(...) when building a pipeline. If my component, i.e. in the case of uimaFIT the class that implements the AE, defines good default values that is very easy and concise. However, I also have many cases where I have different descriptors/descriptions that use the same class, and sometimes I then again override some parameters. I would like to manage those descriptions separately from the pipeline where they are used, e.g. with Maven. I want to avoid copy and pasting common parameter configurations from one pipeline to the other. So far I was using different XML descriptors, each packaged in a separate PEAR as an independent component, and would build my pipeline based on those. With uimaFIT I haven't found a good way to do this. Basically I would like to have inheritance at the description level. This could either be through Java inheritance (i.e. CountryMapper.class extends ConceptMapper.class and overrides default parameter values) or through a way to store EngineDescriptions and reuse them, without having to resort to XML files. So I would in some place define countryMapper = createEngineDescription(ConceptMapper.class, parameters) and package that as a Maven artifact, and somewhere else use it to build a pipeline using createEngineDescription(countryMapper, additional_parameters). My problem is that I don't think I can override the default values with Java inheritance, and don't have a good way to package EngineDescriptions. I guess I could have a class with a static method that returns the engine description and package that, but it would be nice to have something more standard and elegant. Thanks, Jens On 11/28/2013 12:37 AM, Richard Eckart de Castilho wrote: Hi, I'm not sure that I understand what you want to do. When you create a descriptor for a component e.g. using createEngineDescription(…), this descriptor is configured with the default values (unless you override them in the call to createEngineDescription). You can change parameters on such a descriptor using ResourceCreationSpecifierFactory.setConfigurationParameters(…) Does that help? Can you make a more vivid example of what you are trying to accomplish, maybe with a bit if pseudo-code marking those places that remain unclear how to handle them? Cheers, -- Richard On 27.11.2013, at 07:47, Jens Grivolla j+...@grivolla.net wrote: Hi, so far we were using PEARs to manage different configurations of components, e.g. having a CountryMapper, CityMapper, PersonMapper, etc., all based on ConceptMapper but with different settings/models. How would I do that in uimaFIT? Basically I would like to create components that just override the default values for parameters/resources. In some cases, parameters are additionally overridden at the pipeline level (CPE/uimaFIT), e.g. when using a database CasConsumer where we would have several base configurations (e.g. annotation to DB column mappings), but then override the DB connection settings in the pipeline. Having the full configuration at the pipeline level makes it much more difficult to manage configurations, so I would like to be able to point to a given component and automatically get the correct default settings. Thanks, Jens
uimaFIT: managing component configurations
Hi, so far we were using PEARs to manage different configurations of components, e.g. having a CountryMapper, CityMapper, PersonMapper, etc., all based on ConceptMapper but with different settings/models. How would I do that in uimaFIT? Basically I would like to create components that just override the default values for parameters/resources. In some cases, parameters are additionally overridden at the pipeline level (CPE/uimaFIT), e.g. when using a database CasConsumer where we would have several base configurations (e.g. annotation to DB column mappings), but then override the DB connection settings in the pipeline. Having the full configuration at the pipeline level makes it much more difficult to manage configurations, so I would like to be able to point to a given component and automatically get the correct default settings. Thanks, Jens
Re: uimaFIT: external resource bindings?
Ok, I guess I don't actually need to do that, ConceptMapper only looks for the key and doesn't seem to know about the indirect binding, right? And in uimaFIT if I want to bind the same resource to several AEs I use createExternalResourceDescription() and then just pass it like any other parameter to createEngineDescription()? Bye, Jens On 10/24/2013 05:28 PM, Jens Grivolla wrote: Hi, I'm trying to run ConceptMapper from uimaFIT, but createDependencyAndBind doesn't seem to allow to separate declaring the external resource (with a name) and binding that name to a key. I looked through ExternalResourceFactory but didn't find any method that seems to obviously do what I need. What should I do? Btw, I updated ConceptMapper to be based on JCasAnnotator_ImplBase instead of Annotator_ImplBase and TextAnnotator (both deprecated). Bye, Jens
next UIMA workshop?
Hi, at GSCL 2013 we talked a bit about options for the next UIMA workshop. How about trying to have it at COLING 2014? WORKSHOP TIMELINE • 19th January 2014: Workshop proposals due • 26th January 2014: Notification of workshop acceptances • 18th July 2014: Camera-ready deadline for workshop proceedings • 23rd and 24th August 2014: COLING Workshops http://www.coling-2014.org/workshop-call.php So that would be approximately one year after the GSCL workshop which would probably give enough time for people to have new things to present, and there are still 3 months before submitting the workshop proposal. COLING is going to be in Dublin, which makes it relatively easy to attend for the European UIMA community. What do you think? Bye, Jens
Re: Working with very large text documents
On 10/18/2013 10:06 AM, Armin Wegner wrote: What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. Just out of curiosity, how can you possibly have 9GB of text that represent one document? From a quick look at project gutenberg it seems that a full book with HTML markup is about 500kB to 1MB, so that's about a complete public library full of books. Bye, Jens
Re: AW: Working with very large text documents
Ok, but then log files are usually very easy to split since they normally consist of independent lines. So you could just have one document per day or whatever gets it down to a reasonable size, without the risk of breaking grammatical or semantic relationships. On 10/18/2013 12:25 PM, Armin Wegner wrote: Hi Jens, It's a log file. Cheers, Armin -Ursprüngliche Nachricht- Von: Jens Grivolla [mailto:j+...@grivolla.net] Gesendet: Freitag, 18. Oktober 2013 11:05 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 10/18/2013 10:06 AM, Armin Wegner wrote: What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. Just out of curiosity, how can you possibly have 9GB of text that represent one document? From a quick look at project gutenberg it seems that a full book with HTML markup is about 500kB to 1MB, so that's about a complete public library full of books. Bye, Jens
Re: uimafit maven plugin: type system imports?
I gave up on integrating uimaFIT-based builds with PEAR packaging, there are fundamental differences that I don't know how to resolve cleanly, in particular: uimaFIT: 1 maven artifact = N analysis engines = N generated descriptors PEAR packaging maven plugin: 1 mvn artifact = 1 AE = 1 descriptor = 1 generated PEAR I don't think it's worth it to extend the PEAR packaging maven plugin to generate multiple PEARs, so we'll just stick with having PEAR packaging as something separate. I'm actually thinking of separating components packaged as PEARs (as described by the XML descriptors) from analysis engines (the actual code) packaged as JARs, with separate namespaces. That's pretty much the separation we have right now, but without the separate namespaces. In that case it would be clear that a component is basically a packaged engine (with parameter settings, etc.). I created UIMA-3346 (https://issues.apache.org/jira/browse/UIMA-3346) as for other descriptor based workflows it would still be very useful to have automatically generated descriptors that are ready to use with type system imports. Bye, Jens On 10/08/2013 12:04 PM, Jens Grivolla wrote: Hi, I'm still having some other problems in getting it to work well with the pear packaging plugin (naming conventions, descriptor locations, etc.), so I'm not sure if I can create a fully automated build. It would still be nice to not have to edit the descriptor manually, but since I have to do some manual steps anyway it's not as important to get it fixed right now. I'll create the feature request anyway, as it would be quite useful for people using CPE, UIMA-AS or other descriptor-based deployments... Bye, Jens On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote: It is a known gap. I deliberately left this out of the current version because the auto-detect mechanism (types.txt) may detect much more than the component needs. Input/output capabilities are also not a reliable source of information, in particular for components in which types are configured via parameters. I don't think it would be difficult to add. Please open a feature request if you need this, along with a motivation. If you can spare the time, patches are surely welcome. It would probably be good to have this enabled by default, but allow to disable it. Cheers, -- Richard On 04.10.2013, at 12:41, Jens Grivolla j+...@grivolla.net wrote: Hi, I tried using the uimafit maven plugin, in particular the generate goal (trying to make it play nice with the pear packaging plugin). However, the generated descriptor does not include the type system imports, even though they are specified through types.txt. Is there some way to get those imports in the descriptor? Thanks, Jens
Re: uimafit maven plugin: type system imports?
I don't think having more than one AE per PEAR would work, so the only solution would be to generate several PEARs from one project / maven module. This would introduce considerable additional complexity (it would have to discover all available components, etc.), and at least for us it's not worth it. Don't worry about it, having PEAR packaging as something separate (possibly with modifications to the descriptor, etc.) and needing some manual steps to do it is no big deal. We might even move away from using PEARs and instead use uimaFIT based pipeline assembly for most of our work... Thanks for your great work, Jens On 10/14/2013 11:55 AM, Richard Eckart de Castilho wrote: It would be possible to have just one AE per Maven module so uimaFIT generates only one descriptor. How do you imagine to handle it if a PEAR module contains more than one AE? How would the PEAR work? -- Richard On 14.10.2013, at 11:30, Jens Grivolla j+...@grivolla.net wrote: I gave up on integrating uimaFIT-based builds with PEAR packaging, there are fundamental differences that I don't know how to resolve cleanly, in particular: uimaFIT: 1 maven artifact = N analysis engines = N generated descriptors PEAR packaging maven plugin: 1 mvn artifact = 1 AE = 1 descriptor = 1 generated PEAR I don't think it's worth it to extend the PEAR packaging maven plugin to generate multiple PEARs, so we'll just stick with having PEAR packaging as something separate. I'm actually thinking of separating components packaged as PEARs (as described by the XML descriptors) from analysis engines (the actual code) packaged as JARs, with separate namespaces. That's pretty much the separation we have right now, but without the separate namespaces. In that case it would be clear that a component is basically a packaged engine (with parameter settings, etc.). I created UIMA-3346 (https://issues.apache.org/jira/browse/UIMA-3346) as for other descriptor based workflows it would still be very useful to have automatically generated descriptors that are ready to use with type system imports. Bye, Jens On 10/08/2013 12:04 PM, Jens Grivolla wrote: Hi, I'm still having some other problems in getting it to work well with the pear packaging plugin (naming conventions, descriptor locations, etc.), so I'm not sure if I can create a fully automated build. It would still be nice to not have to edit the descriptor manually, but since I have to do some manual steps anyway it's not as important to get it fixed right now. I'll create the feature request anyway, as it would be quite useful for people using CPE, UIMA-AS or other descriptor-based deployments... Bye, Jens On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote: It is a known gap. I deliberately left this out of the current version because the auto-detect mechanism (types.txt) may detect much more than the component needs. Input/output capabilities are also not a reliable source of information, in particular for components in which types are configured via parameters. I don't think it would be difficult to add. Please open a feature request if you need this, along with a motivation. If you can spare the time, patches are surely welcome. It would probably be good to have this enabled by default, but allow to disable it. Cheers, -- Richard On 04.10.2013, at 12:41, Jens Grivolla j+...@grivolla.net wrote: Hi, I tried using the uimafit maven plugin, in particular the generate goal (trying to make it play nice with the pear packaging plugin). However, the generated descriptor does not include the type system imports, even though they are specified through types.txt. Is there some way to get those imports in the descriptor? Thanks, Jens
Re: Designing collection readers: Reading multiple XML files containing multiple CASes
It sounds to me like it would be much easier to just have a custom collection reader that outputs one CAS per document (i.e. multiple CASes per input file), rather than having a CR that outputs one CAS per file (with just metadata) plus an additional AE to generate the real CASes from there. Do you have a specific reason for not simply writing a Collection Reader that does what you want? Bye, Jens On 10/07/2013 03:19 AM, swirl wrote: Hi, I am wondering if anyone has a better idea. Requirement: a. I have a pipeline that needs to process a bunch of XML files. b. The XML files could be on the disk, or from a remote location (available via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml) c. Each XML file contain mulitple sections, each section's content should be parsed to produce a separate CAS d. I need to able to parse XML of different schema. Although the assumption is that each pipeline run can only handle one specific XML schema. That is, I do not need to handle different XML schema in each pipeline run. e. With the above, I need to be able to construct a new collection reader, parser based on specific needs of each application. f. For e.g., I can specify that the XML files are in a disk folder, and to use parser A to decode the specific schema of the XML files. In another pipeline, I can specify to the collection reader a list of URLs to retrieve some remote XML files and parse them using parser B. Here are what I have so far: a. I am using cleartk's UriCollectionReader to insert URIs of files into the CAS from local disk folders and remote URIs. So far so good. b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS and parse the file according to XML schema A. c. But the above only produce 1 CAS per XML file. Requirement c. is not fulfilled. I need to produce multiple CASes from a single XML file. How do I do this? Thanks in advance.
Re: uimafit maven plugin: type system imports?
Hi, I'm still having some other problems in getting it to work well with the pear packaging plugin (naming conventions, descriptor locations, etc.), so I'm not sure if I can create a fully automated build. It would still be nice to not have to edit the descriptor manually, but since I have to do some manual steps anyway it's not as important to get it fixed right now. I'll create the feature request anyway, as it would be quite useful for people using CPE, UIMA-AS or other descriptor-based deployments... Bye, Jens On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote: It is a known gap. I deliberately left this out of the current version because the auto-detect mechanism (types.txt) may detect much more than the component needs. Input/output capabilities are also not a reliable source of information, in particular for components in which types are configured via parameters. I don't think it would be difficult to add. Please open a feature request if you need this, along with a motivation. If you can spare the time, patches are surely welcome. It would probably be good to have this enabled by default, but allow to disable it. Cheers, -- Richard On 04.10.2013, at 12:41, Jens Grivolla j+...@grivolla.net wrote: Hi, I tried using the uimafit maven plugin, in particular the generate goal (trying to make it play nice with the pear packaging plugin). However, the generated descriptor does not include the type system imports, even though they are specified through types.txt. Is there some way to get those imports in the descriptor? Thanks, Jens
uimafit maven plugin: type system imports?
Hi, I tried using the uimafit maven plugin, in particular the generate goal (trying to make it play nice with the pear packaging plugin). However, the generated descriptor does not include the type system imports, even though they are specified through types.txt. Is there some way to get those imports in the descriptor? Thanks, Jens
Re: AW: Java level prerequsite upgrade?
Same here, our own stuff relies on higher versions of Java anyway. Jens On 07/29/2013 07:55 AM, armin.weg...@bka.bund.de wrote: No, not for me. You can even switch to Java 7. Armin -Ursprüngliche Nachricht- Von: Marshall Schor [mailto:m...@schor.com] Gesendet: Sonntag, 28. Juli 2013 16:05 An: uima-user Betreff: Java level prerequsite upgrade? Dear Users, The UIMA developers would like to be able to start using Java 6 language features; of course this would require users to be running this level or later. Currently, we require only Java 5 or later. Java 5 from various vendors is either past end-of-life or approaching it (meaning no updates, unless you have some special contracts). See http://www.oracle.com/technetwork/java/eol-135779.html or http://www.ibm.com/software/support/lifecycle/ If we started requiring Java 6 or later, would this be an issue for you? -Marshall Schor
Building UIMA AEs with Gradle?
Hi, we are sometimes running into problems with Maven when we want to define tasks to move resources into specific locations, etc. This seems to often lead to having to use quite a few Maven plugins and makes the POM hard to manage. Would Gradle be a better option, in order to have the dependency management from Maven while being able to more easily define custom manipulations of resources to help with packaging? Is it possible to generate PEAR packages from Gradle? There are afaik plugins for Maven and Ant, so would we then reference an Ant task from Gradle? Thanks, Jens
AE project structure
Hi, we currently (almost always) use the CPE to run our AEs (packaged as PEARs and then installed). However, we would like to start packaging our AEs differently to make it easier to also use them programatically, or e.g. include them in Solr using SolrUima. To do so we have started to modify some of our annotators so they load their resources from the classpath instead of using a file path and are getting closer to being able to package everything in JAR files. However, the standard UIMA project structure puts things quite differently from a typical Maven layout, meaning that there's quite a bit of tweaking to make things fit with being both resolvable from the classpath and staying close to the UIMA structure. Should we just forget about uima.datapath and the /resource and /desc folders and put it all in /src/main/resources etc.? How compatible would that be with the PearPackagingMavenPlugin? I think we will move to using UimaFit once it is released, but for some of the people here being able to have readily packaged PEAR files with descriptors that can be distributed is a big advantage that we don't want to give up. Thanks, Jens
managing resources for UIMA?
Hi, while not strictly a UIMA issue, we have a problem that seems very relevant in the context of UIMA analysis engines: how to manage large binary resources such as trained models used by an AE, etc. So far, we have managed to achieve a good separation between code development and the actual AEs, using Maven (and git for version control). An AE thus consists only of a POM referencing the code, the AE descriptor, and the resources used for the AE. The AE poms are configured to generate PEAR archives that include all dependencies and resources. At this point we have the code in git, and the AEs' pom and descriptor also, while we manually copy the resources to the directory before running `mvn package` (and exclude those resources from git). We're missing a way to manage those resources, including versioning etc. I'm guessing that this is a rather typical problem, so what solutions do you use? We're thinking of having all resources also in Maven (e.g. Artifactory) so we can reference them with a unique identifier and version. This would also help us when moving to more complex pipeline assemblies using uimafit instead of generating individual PEARS for each component in order to create complete packages. Btw, we are just very few core developers, with most of the team made up of linguists, so we want to make it easy for them to save versions of resources they create and assemble AEs by just referencing the algorithm and resource (e.g. create a new OpenNLP POStagger using spanish-pos-model.bin, version 1.2.3). Thanks for sharing your experiences with this... Jens
Re: Does the UIMA pipeline support analysis components written as mahout map-reduce jobs
What do you want to do? Map-reduce is batch processing, whereas a UIMA AE works online, so this doesn't really fit. In Mahout map-reduce is usually used for training, not e.g. for applying a trained classifier. So you would train whichever way you want (e.g. using map-reduce, etc.), but your UIMA AE would actually be a wrapper for an online classifier, not a map-reduce task. Best, Jens On 02/13/2013 11:47 PM, Som Satpathy wrote: Hi all, I have been toying around with UIMA pipelines for some time now. I was wondering if UIMA can support analysis components written as mahout map-reduce jobs as part of a UIMA pipeline ? I would appreciate any help/hints/pointers. Thanks, Som
graphical flow configuration in UIMA-HPC?
Hi, the UIMA-HPC page contains a nice screenshot of what looks like a graphical tool for configuring UIMA flows. Is it (or anything like it) available to the public? Thanks, Jens
UIMA for multimodal annotation?
Hi, we're thinking of using UIMA for multimodal multimedia annotation (text, video, audio, ...), but have found little information of people actually doing that. I did find an old post by Burn Lewis about donating the GALE type system (Donation of a widely used type system for multi-modal text analysis) but not much more. Thanks, Jens
Re: Parallel CAS consumer
Hi all, from what I understand this does not involve CAS multipliers at all, but simply a flow where all CAS consumers are done in one parallel step. Apparently this can't be done in a CPE so you would need an aggregate of all the CAS consumers, and have a parallel flow controller for that aggregate. However, that wouldn't really do any good according to the documentation: ParallelStep, which specifies that multiple Analysis Engines should receive the CAS next, and that the relative order in which these Analysis Engines execute does not matter. Logically, they can run in parallel. The runtime is not obligated to actually execute them in parallel, however, and the current implementation will execute them serially in an arbitrary order. Best, Jens On 10/10/2012 12:39 PM, Richard Eckart de Castilho wrote: Hi, I see. I think this is not possible. To my knowledge CPE (which you probably use) does not support CAS multipliers. I'm not too familiar with UIMA-AS, are you sure that it supports such a scenario? If you manage to get realize the scenario as you described, it would be great to hear how you did it. Best, -- Richard Am 10.10.2012 um 12:15 schrieb Timo Boehme timo.boe...@ontochem.com : Hi, Am 10.10.2012 12:05, schrieb Richard Eckart de Castilho: the main difference between CAS consumers and analysis engines is that the former be default run only a single instance and the latter can be multiplied. If your consumer code can be run in parallel, just try inheriting from AnalysisEngine_ImplBase (or something like that) instead. Thanks for your answer. However each single consumer must run as single instance (e.g. one database consumer, one consumer writing to a file; each of them need to run as single instance). Thus I would like to have a single instance per consumer but the different consumer to run in parallel. Kind regards, Timo Am 10.10.2012 um 12:00 schrieb Timo Boehme timo.boe...@ontochem.com : Hi, is there any possibility without using UIMA-AS to run different CAS consumer components of a pipeline in parallel? The standard behavior is that the consumer are called in sequence, but since in my case they don't depend on each other it would be more efficient to have them run in parallel. Can I use CAS multiplier + Flow control to achieve this?
Re: Clustering, Collapsing
This sounds like you are actually looking for the project next door: Mahout. UIMA really doesn't have a lot to do with clustering (although you could do some things). We do use UIMA for information extraction *before* clustering and sending it to Solr, though, as a sort of preprocessing to get relevant features from unstructured text. But it doesn't sound like that's what you're trying to do. HTH, Jens On 06/08/2012 05:44 PM, Deejay wrote: Hi all, I recently discovered Apache UIMA, and it looks like a very large project! I was hoping that someone more experienced with it than I could comment on whether there are parts of the project that could help with my problem. I need to go over many millions of objects (Protocol Buffers in HBase, as it happens), and cluster them according to their similarity. Once each cluster is formed, I need to 'collapse' each property of the objects to find the most prevalent value. After this, the collapsed object will be added to a Solr index. Would any part of Apache UIMA be useful for the clustering or collapsing, or have I misunderstood the nature of the project?
Re: Repackaging an unpackaged pear file
We actually do that all the time, it works perfectly. Some archive managers even let you edit the file without unpacking it. You may need to rename it from .pear to .zip and back to .pear when you're done. Jens On 04/26/2012 06:10 PM, Marshall Schor wrote: Thanks Thilo. Could you unzip the pear with an unzipper, and do the change to fix the file path and then zip it back up again? That way the variable replacement stuff wouldn't run. -Marshall On 4/26/2012 5:07 AM, Thilo Goetz wrote: On 25/04/12 23:20, Marshall Schor wrote: I hope its trivial :-) (But I haven't tried it...). It's not trivial, because the pear installer desctructively replaces variables with local paths on installation. If you don't know what you're doing, it will be much easier to ask the other team to get you the original pear file. There is no supported way to repackage an installed pear file. --Thilo -Marshall On 4/25/2012 1:15 PM, Mike O'Leary wrote: I received a copy of an application that works with UIMA a few weeks ago from some colleagues at another location. When I followed the instructions to install it, I got an error message while unpacking a pear file, and it looks like an XML file within it contains some hard-coded pathnames to a machine at the organization that sent our colleagues the application originally. I could ask them to get in touch with the organization and ask them to recreate the pear file with relative pathnames so it can be installed on machines on other networks, and I probably will do that. But I was wondering how hard it would be just to correct the pathnames, re-package the pear file, and reinstall that one. I have never worked with UIMA before, so I am learning the basics as I go. How complicated would it be to create an Eclipse project using the directory structure that the pear file expanded to, or to run a command line application that creates a pear file from that directory structure? Thanks, Mike
Re: Unusable Document Analyzer because of too small font sizes
On 03/25/2012 03:35 PM, Eric Buist wrote: [UIMA chooses bad look-and-feel on some platforms] Fortunately, I found a workaround: pass -Dswing.systemlaf=com.sun.java.swing.plaf.gtk.GTKLookAndFeel in the JVM arguments. That overrides the bad guess of the JVM and fallback to Metal. Note that the usual property name would be swing.defaultlaf, but I had to use swing.systemlaf because of the DocAnalyzer activating the system look and feel (this would have normally worked, though). If I could find a way to have this passed all the times, without having to change the launch configuration of each UIMA tool, I would be happier, but this is already a very good step forward. At least I am not blocked anymore by this, and can continue exploring under Linux. Pretty much all UIMA tools that you run from the command line are run through runUimaClass.sh, so whatever settings you make there should work pretty much universally. HTH, Jens
InlineXMLCasConsumer fails depending on locale
Hi, it appears that InlineXMLCasConsumer depends on the system locale for some internal transformations. The output appears to be written in UTF8 (outStream.write(xmlAnnotations.getBytes(UTF-8));) but when used on a machine with a locale of ASCII all accented characters get broken. I suspect that it has to do with the XMLSerializer working on a ByteArrayOutputStream, but haven't been able to track it down yet. Any ideas? Bye, Jens
Re: InlineXMLCasConsumer fails depending on locale
On 02/21/2012 04:08 PM, Thilo Goetz wrote: On 21/02/12 15:59, Jens Grivolla wrote: it appears that InlineXMLCasConsumer depends on the system locale for some internal transformations. The output appears to be written in UTF8 (outStream.write(xmlAnnotations.getBytes(UTF-8));) but when used on a machine with a locale of ASCII all accented characters get broken. I suspect that it has to do with the XMLSerializer working on a ByteArrayOutputStream, but haven't been able to track it down yet. Have you checked that it's really the writing end where things get corrupted, and not the reading end? Just a thought... Yes, we have an XmiWriterCasConsumer in parallel that works fine. Jens
Re: UIMA Python integration?
Hi Nicolas, we haven't really made any progress. Right now we're using only Java within the UIMA pipeline (and one C++ annotator). We then generate XMIs (or in some cases inline XML to get annotations aligned automatically) and work on that in Python, without a library and probably not even dealing with the XMI format entirely correctly. :-( Anyway, things are not pretty, but we just don't have time to actually develop a better solution. Bye, Jens On 12/16/2011 12:13 AM, Nicolas Hernandez wrote: Hi 6 months later. Jens what experience have you learn about UIMA and Python ? Is Pythonnator still the simplest solution for working with XMI ? No other alternative ? Best /Nicolas On Wed, May 4, 2011 at 11:48 PM, Eddie Epsteineaepst...@gmail.com wrote: The last update with uimapy on Apache UIMA was that it had problems deserializing somewhat complex XmiCas examples. The previous problem with jython was that it was backlevel relative to the needs of some python analytic code. Jython seems like the simplest integration, assuming it works. The Pythonnator requires a uimacpp runtime. More complicated, but perhaps a much faster python execution environment? Uimacpp fully supports XmiCas serialization methods. Eddie On Wed, May 4, 2011 at 4:42 AM, Jens Grivollaj+...@grivolla.net wrote: Hi, what's the current status on combining UIMA and Python? I know that it should be possible to write AEs in Python using either the BSF Annotator (and jython) or Pythonnator (using SWIG). I haven't tried either one yet, so I'm open to recommendations on which to use. I would also very much like to write UIMA (and especially UIMA AS) clients in Python. Is it possible at all to use an annotation pipeline from a language other than Java? We are currently using the simple REST server for this, but it has serious limitations. Lastly, and probably more simply, I would like to be able to work with XMI files using Python. There used to be uimapy by Ed Loper, but I can't find a copy anywhere and the sourcefore repository is empty. I found no mention on the mailing list of what happened to the project and the discussion about seems to just have ended quite abruptly. Thanks for any suggestions or hints, Jens
Re: Setting up third party libraries for UIMA AS application
Hi, that's basically what we are doing, too. If the PEAR is configured correctly, the CLASSPATH and uima.datapath should appear in install.xml and setenv.txt, and you could use those to set your classpath in your executor.bat. You would then avoid having to define path_to_my_third_party_libraries separately. Unfortunately, it is not always possible (in Linux) to just `source setenv.txt` to set the environment variables because it fails on the uima.datapath assignment (I believe it would work with UIMA_DATAPATH). So it is often still necessary to adapt the launch script depending on the component you are working with.I have no idea how it is in Windows. If there are any better suggestions, I'd be interested also. Bye, Jens On 11/16/2011 01:50 PM, Spico Florin wrote: Hello, Jens! In order to solve the problem, I've created a bash script named executor.bat where I've added the following lines: @set UIMA_CLASSPATH=%UIMA_CLASSPATH%;path_to_my_third_party_libraries deployAsyncService.cmdmy_deployment_descriptor_for_as.xml I've put this script in a folder my_project/deploy/as/ and then I've set up thepath_to_my_third_party_libraries with the relatives paths to the lib and bin folders of the project: i.e. @set UIMA_CLASSPATH=%UIMA_CLASSPATH%;../../lib;../../bin The structure of the project after installing the pear file will look like this: installed I I-uima-pipeline I-bin I-deploy I-as I-executor.bat (from here we will execute the script) I-lib I-third_party_library.jar I-descriptors I-metadata I-resources I don't know that the above is a solution, but it worked for me. Therefore, I have the following question: How can I use the variables set up in install.xml and setenv.txt in my executor.bat script? I'll look forward for your answer, Thank you. Regards, Florin On Tue, Nov 15, 2011 at 4:21 PM, Jens Grivollaj+...@grivolla.net wrote: On 11/15/2011 02:55 PM, Spico Florin wrote: Hello! I have an UIMA AS application that is using third party libraries. I would like to know the following: 1. Where (location) we can add these third libraries such that the deployed application to be aware of them and not throwing ClassNotFoundException? A brute force solution for me, was to add them directly in the UIMA AS lib/ folder, but this solution was just for testing and is not acceptable in production. 2. How can be set up this third party libraries when generating PEAR file in a such a way that deploying the application will consider them and won't be necessary to manually add them to the classpath? UIMA AS doesn't directly support PEAR files. You will have to install the pear and set the classpath when you deploy it to UIMA AS. Where to put libraries so they will be correctly referenced in the PEAR (i.e. they are included in the install.xml and setenv.txt) depends on how you build the PEAR. You may need to include the libraries in your Eclipse build path, or put them in a directory that your Maven configuration includes when building the PEAR. HTH, Jens
Re: Setting up third party libraries for UIMA AS application
On 11/15/2011 02:55 PM, Spico Florin wrote: Hello! I have an UIMA AS application that is using third party libraries. I would like to know the following: 1. Where (location) we can add these third libraries such that the deployed application to be aware of them and not throwing ClassNotFoundException? A brute force solution for me, was to add them directly in the UIMA AS lib/ folder, but this solution was just for testing and is not acceptable in production. 2. How can be set up this third party libraries when generating PEAR file in a such a way that deploying the application will consider them and won't be necessary to manually add them to the classpath? UIMA AS doesn't directly support PEAR files. You will have to install the pear and set the classpath when you deploy it to UIMA AS. Where to put libraries so they will be correctly referenced in the PEAR (i.e. they are included in the install.xml and setenv.txt) depends on how you build the PEAR. You may need to include the libraries in your Eclipse build path, or put them in a directory that your Maven configuration includes when building the PEAR. HTH, Jens
Re: Running Collection Processor Engine as Rest Web Service
I'm not sure how you would want to expose that functionality. Since input and output would be done through the API, those are basically your Reader and your Consumer. How would you expose other CollectionReaders and CasConsumers as a web service? AAEs are obviously no problem, since they are nothing but a specific type of Analysis Engine. Bye, Jens On 11/02/2011 04:27 PM, Spico Florin wrote: Hello! I would like to know if it is possible in UIMA to run CPE as a Rest Web Sevice? I've read that you can expose tha Analysis Engine (AE) . I'm not sure if CASReader,CASConsumer ,Aggregate Analysis Engine, CPE can be also exposed as REST Web Service. Can you please provide some example on how to do this? I look forward for your answers. Thank you. Regards, Florin
Re: PEAR packaging and maven
On 05/26/2011 08:37 PM, Greg Holmberg wrote: [...] What I want may simply be outside the design target of PEAR files. My expectations of PEAR files were based on how other archive formats in Java work. JAR files, WAR files, etc. These can all be use in-place, without any re-writing of their contents. You can just refer to them, and the system can locate the things in them at run-time through relative paths, regardless of what directory they've been dropped into. In other words, there's no installation process for JAR files or WAR files. [...] At least in Tomcat, WAR files actually get unzipped before use, and the UIMA SimpleRestServer also installs PEAR files on the fly. Using SimpleRestServer you actually have one WAR file which contains a PEAR file, and when you deploy it in Tomcat, it automatically unzips the WAR and then (I believe on first call to the service) installs the PEAR. The descriptor used in the WAR references the PEAR file directly. I believe that WARs only get installed when they haven't been installed yet, and I would hope the same is true for the PEAR installation, so there's no overhead from installing on each run of the pipeline. It still rewrites $main_root, etc., but it gets pretty close to a transparent use of PEARs. At least for UIMA-AS a similar deployment scheme would make sense, and one could substitute install pear, adjust classpath, deploy pointing to pear descriptor with simply deploy pointing to pear file which would be much nicer, as we could skip all the launch scripts we are currently creating. For other uses one might have to think more about automatically cleaning up the unzipped directory, etc. Bye, Jens
Re: Cas Editor: group annotation types by namespace?
On 05/10/2011 10:13 AM, Richard Eckart de Castilho wrote: [package names vs. type hierarchy] For a technically-oriented user, the package names are probably better. But for a linguist or knowledge-engineer, I am pretty sure that the inheritance hierarchy is more interesting. One dives down to the particular level one to which he can still make a distinction and then stops. Yes, I guess that depends how you use the inheritance. In our case we go from the general UIMA Annotation to our own generic Annotation type that adds a few features, then the generic manual Annotation with features specific to human annotations, etc. So the inheritance is purely technical and implies no semantic hierarchy, and it makes no sense at all to a human annotator to go through all those levels that are completely meaningless to them. I think it would be good to offer both approaches, maybe on different key-bindings and/or different sub menus reachable from the context menu. I agree that maintaining the old behaviour for your use case makes sense, so we would need either two menus or a project-wide preference. Jens
Cas Editor: group annotation types by namespace?
Hi, I was wondering if it wouldn't be more useful to group annotation types in the mode and similar menus by namespace rather than inheritance. I don't think most users care much about supertypes, and mostly don't know about them, whereas the namespace seems to me to be a more natural way to organize the menus. I think a flat top level with all used namespaces would work quite well, and the submenu with the annotation type names would not need to include the prefix. What do you think? Jens
Re: Cas Editor: selecting annotation type
On 05/05/2011 09:30 PM, Jörn Kottmann wrote: On 5/5/11 6:09 PM, Jens Grivolla wrote: On 05/05/2011 03:04 PM, Jörn Kottmann wrote: On 5/5/11 2:41 PM, Jens Grivolla wrote: At least on my system (Eclipse Helios on Ubuntu 10.10) the Shift+Enter shortcut does not work, and will be treated as an unmodified Enter, i.e. no selection list appears. I haven't tried yet on other systems because I need to install the updated plugins first. Ok, I will investigate that. But then this was not the system where you experienced the hang issue in the 2.3.1 version? As you said, the freeze was due to the shortcut creation when the type system is too big, and it ocurred on all machines. I sometimes have to press return twice to get a quick annotation, too, and on a different machine (Eclipse Helios on Windows XP) it worked even less, to the point that I had to use the context menu. I opened a jira for the short cut issue and fixed it, would be nice if you could test. I believe the issue is related to a recently defined command and key binding in the plugin.xml. I also now did this for the quick type selection dialog short cut. Here is the jira: https://issues.apache.org/jira/browse/UIMA-2139 Shift+Enter now seems to work reliably. Plain Enter works when I select a word via double click, but has problems when I select a text span (on my Linux machine). Shift+Enter works in that case, and plain Enter works after pressing Shift+Enter or just pressing any other key, eg. Shift or Ctrl. On that machine the edit view was having problems, too, and I usually had to click on the feature name before being able to activate the feature value field. I haven't tried Shift+Enter on that machine. Did you run the current trunk on that machine? If so would be nice if you can give me further details about the edit view issues. What type had the feature you clicked on? Are there exceptions in the error log? Yes, that was running trunk with yesterday morning's fixes. Unfortunately, I don't have access to that machine anymore and can't give you any more details at this point. We do have some other Windows machines though, and I will look if I find anything in the error logs both on Linux and Windows. [..] Which brings me to another thing that would be interesting for us: having preset feature values filled in automatically. We would be using that to automatically fill in the annotator's name on all annotations created by them. This you can easily do when you pre-process the files you pass to the annoator, or post-process when he gives them back. I've been thinking about that option. It would be quite easy at the document level, but becomes a bit more complicated when each annotation can come from a different annotator and files get passed from one annotator to the next. For one project I created a small plugin which just defined a view for something similar. Its actual not difficult to access the CAS and updates to it through the Annotation Editor. We're currently thinking of just post-processing the XMIs and adding the annotator name to all annotations (of the types of interest) that don't have a name set yet. We'll look into doing something more sophisticated for the next round of annotations. Thanks a lot for your help, Jens
Re: Cas Editor: selecting annotation type
On 05/04/2011 02:44 PM, Jörn Kottmann wrote: On 5/4/11 2:33 PM, Jens Grivolla wrote: How do I best update to the trunk version? You can either build the trunk version yourself or pick up a distribution from our build server. I've got a local build based on trunk. I am not sure what is the best way to update, or what will happen if an old and newer version is installed into the same eclipse installation. I would try to put the new eclipse plugins into the dropins folder, and then see if they get loaded instead, if not I suggest that you remove the plugins installed via Install new software It seems to have picked it up ok, but I'm getting errors when opening an XMI with the Annotation editor: Caused by: org.eclipse.core.internal.resources.ResourceException: Resource '/OneOfMyClosedProjects' is not open. at org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137) at org.eclipse.core.internal.resources.Project.hasNature(Project.java:511) at org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90) Apparently the migration from CasEditorProjects to the new way fails whenever there is a closed project in the workspace. I don't have any projects that need to be migrated, but it tries to check every project I have and fails hard when it can't. It would be good if somebody could verify that before filing a bug report. Bye, Jens
Re: Cas Editor: selecting annotation type
On 05/05/2011 12:37 PM, Jens Grivolla wrote: I'm getting errors when opening an XMI with the Annotation editor: Caused by: org.eclipse.core.internal.resources.ResourceException: Resource '/OneOfMyClosedProjects' is not open. at org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137) at org.eclipse.core.internal.resources.Project.hasNature(Project.java:511) at org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90) Apparently the migration from CasEditorProjects to the new way fails whenever there is a closed project in the workspace. I don't have any projects that need to be migrated, but it tries to check every project I have and fails hard when it can't. It would be good if somebody could verify that before filing a bug report. It works fine after I removed the check. - if ( project.hasNature(org.apache.uima.caseditor.NLPProject)) { + if (false) { Bye, Jens
Re: Cas Editor: selecting annotation type
On 05/05/2011 12:59 PM, Jörn Kottmann wrote: On 5/5/11 12:55 PM, Jens Grivolla wrote: On 05/05/2011 12:37 PM, Jens Grivolla wrote: I'm getting errors when opening an XMI with the Annotation editor: Caused by: org.eclipse.core.internal.resources.ResourceException: Resource '/OneOfMyClosedProjects' is not open. at org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137) at org.eclipse.core.internal.resources.Project.hasNature(Project.java:511) at org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90) Apparently the migration from CasEditorProjects to the new way fails whenever there is a closed project in the workspace. I don't have any projects that need to be migrated, but it tries to check every project I have and fails hard when it can't. It would be good if somebody could verify that before filing a bug report. It works fine after I removed the check. - if ( project.hasNature(org.apache.uima.caseditor.NLPProject)) { + if (false) { Yes, but that would disable to migration code. The fix is now: if (project.isOpen() project.hasNature(org.apache.uima.caseditor.NLPProject)) Of course, it was just to get it working as soon as possible. I recompiled with your fix and reinstalled the plugins, and I see no problems. Thanks, Jens
Re: Cas Editor: selecting annotation type
On 05/05/2011 03:04 PM, Jörn Kottmann wrote: On 5/5/11 2:41 PM, Jens Grivolla wrote: On 05/05/2011 01:55 PM, Jörn Kottmann wrote: On 5/5/11 1:44 PM, Jörn Kottmann wrote: That sounds like one more good reason to do that. Another one I thought of is that it is confusing when you add an annotation which you cannot see afterward. So lets open a jira and do this enhancement. Here is the jira: https://issues.apache.org/jira/browse/UIMA-2137 Do you think this dialog fixes the problem you reported initially with the editor annotation mode? Yes, I think that would work quite well for us. One issue with setting the shortcuts based on the full type system is that in our case at hand some of the annotation types we need don't get assigned a shortcut. Nice, I will try to fix this quickly for you. Thanks, that's great. I think that could be a significant time saver. At least on my system (Eclipse Helios on Ubuntu 10.10) the Shift+Enter shortcut does not work, and will be treated as an unmodified Enter, i.e. no selection list appears. I haven't tried yet on other systems because I need to install the updated plugins first. Ok, I will investigate that. But then this was not the system where you experienced the hang issue in the 2.3.1 version? As you said, the freeze was due to the shortcut creation when the type system is too big, and it ocurred on all machines. I sometimes have to press return twice to get a quick annotation, too, and on a different machine (Eclipse Helios on Windows XP) it worked even less, to the point that I had to use the context menu. On that machine the edit view was having problems, too, and I usually had to click on the feature name before being able to activate the feature value field. I haven't tried Shift+Enter on that machine. I still think it would be nice to be able to change the mode from the Outline view, but that feature would definitely have much lower priority then. Yes, I also believe that could be a good place to have it, please open a jira issue for it. done: https://issues.apache.org/jira/browse/UIMA-2138 Do you also need to fill in feature values for each created annotation? Yes, for many of them we do. Which brings me to another thing that would be interesting for us: having preset feature values filled in automatically. We would be using that to automatically fill in the annotator's name on all annotations created by them. This you can easily do when you pre-process the files you pass to the annoator, or post-process when he gives them back. I've been thinking about that option. It would be quite easy at the document level, but becomes a bit more complicated when each annotation can come from a different annotator and files get passed from one annotator to the next. I believe we should start working here on tooling support for annotation projects. There you typically have a collection of documents which must be annotated by a team of annotators. Yes, I think our situation is probably quite typical really. Thanks, Jens
UIMA Python integration?
Hi, what's the current status on combining UIMA and Python? I know that it should be possible to write AEs in Python using either the BSF Annotator (and jython) or Pythonnator (using SWIG). I haven't tried either one yet, so I'm open to recommendations on which to use. I would also very much like to write UIMA (and especially UIMA AS) clients in Python. Is it possible at all to use an annotation pipeline from a language other than Java? We are currently using the simple REST server for this, but it has serious limitations. Lastly, and probably more simply, I would like to be able to work with XMI files using Python. There used to be uimapy by Ed Loper, but I can't find a copy anywhere and the sourcefore repository is empty. I found no mention on the mailing list of what happened to the project and the discussion about seems to just have ended quite abruptly. Thanks for any suggestions or hints, Jens
Cas Editor: selecting annotation type
Hi, I have recently started using the Annotation Editor (as installed in Eclipse from http://www.apache.org/dist/uima/eclipse-update-site/, i.e. the official 2.3.1 version). In order to add annotations it seems that you need to select the annotation type through the Mode context menu, which is quite time consuming (and error prone) if you have a large type system, and especially when the wanted type is derived through several levels of supertypes. Given that you already select the types of interest through the Annotation Styles configuration, it would be much faster to e.g. select your annotation mode directly from the Outline view (which only contains your chosen subset). It seems that there are quite a few changes in the trunk, but I'm not sure how to best use those versions, preferably without messing up my Eclipse configuration (which is a bit fragile when using manually installed plugins). Thanks, Jens
Re: Cas Editor: selecting annotation type
On 05/04/2011 11:21 AM, Jörn Kottmann wrote: On 5/4/11 11:10 AM, Jens Grivolla wrote: In order to add annotations it seems that you need to select the annotation type through the Mode context menu, which is quite time consuming (and error prone) if you have a large type system, and especially when the wanted type is derived through several levels of supertypes. You do not need to switch via the Mode context menu to add an annotation of the desired type. The Mode type is just the type you can annotate with the fewest key strokes. You can use Shift + Enter to annotate a piece of text and then choose the annotation type from a list of available types in a pop up. Each type in this list is combined with a key short cut. When you remember the short cut you can do something like this Shift + Enter + p to create an annotation. Where p is the letter written in front of one of your annotations. Does that help you? Unfortunately this consistently freezes Eclipse every time I have tried it, so I haven't even been able to see what it is supposed to do. The keyboard shortcuts might help, if it worked. We've tried it on several versions of Eclipse (all on Linux), and all freeze completely when pressing Shift-Return or clicking on the corresponding menu item. I will have a look at the outline view, maybe we can add there a button or context menu to switch the mode of the editor. It seems that there are quite a few changes in the trunk, but I'm not sure how to best use those versions, preferably without messing up my Eclipse configuration (which is a bit fragile when using manually installed plugins). We fixed a few bugs and removed the Cas Editor Project support. I suggest that you just create a normal eclipse project and then place a type system at the default location. That's what I'm doing, but with the 2.3.1 release installed via Install new software... in Eclipse. How do I best update to the trunk version? Thanks, Jens
CR+LF = 1 character?
Hi, while working on the integration between UIMA and a different text annotation system we ran into problems with differing offsets between the two systems. As it turns out, the other system considers CR+LF (Windows style line endings) to be two characters, while UIMA sees it as one. Clearly, CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1, ...) so all systems based on those encodings will see it as two characters, and I believe it is also represented as two Unicode characters. In a way it makes sense to consider a newline as one character, independently of how it is represented, so I think the UIMA way is fine. But is there an overview somewhere how different systems and programming language handle this, e.g. when extracting substrings, etc.? Given the mess that this can be it's probably best to normalize all text at the beginning to only deal with Unicode strings with LF endings, encoded with UTF-8 when writing to disk or otherwise serializing the data. It would still be interesting to know how painful this can get when not normalizing, and e.g. passing data between UIMA (Java), NLTK (Python), our own C#-based system, etc. Thanks, Jens
Re: status of uimacpp?
Thanks Bhavani, I think I will just stay with the 2.3.0-incubating uimacpp for now then. Jens On 04/05/2011 10:16 PM, Bhavani Iyer wrote: Hi Jens The 2.3.0-incubating uimacpp will work with the 2.3.1 releases of the uimaj and uima-as. It should work with the ActiveMQ broker 5.4.1 included in the 2.3.1 release of uima-as. We are working on a new release of UIMACPP. The ActiveMQ service deployment wrapper was migrated to ActiveMQ 3.2 in order to support failover protocol - https://issues.apache.org/jira/browse/UIMA-1925. This also requires moving to APR version 1.3.x. In addition, there are changes to the build process on Linux as described here: https://issues.apache.org/jira/browse/UIMA-2053. Regards, Bhavani On 4/5/11, Jens Grivollaj+...@grivolla.net wrote: Hi, what's the current status of UIMA-CPP? While uimaj and uima-as have been released as 2.3.1, uimacpp hasn't and I haven't read of any plans to release 2.3.1 so far. Does 2.3.0-incubating uimacpp work with the 2.3.1 versions of uimaj and uima-as, or should I better build it from trunk? How about mixing 2.3.0 AS components with a 2.3.1 broker, etc.? Thanks, Jens
status of uimacpp?
Hi, what's the current status of UIMA-CPP? While uimaj and uima-as have been released as 2.3.1, uimacpp hasn't and I haven't read of any plans to release 2.3.1 so far. Does 2.3.0-incubating uimacpp work with the 2.3.1 versions of uimaj and uima-as, or should I better build it from trunk? How about mixing 2.3.0 AS components with a 2.3.1 broker, etc.? Thanks, Jens
runPearMerger on already merged PEARs
It seems that runPearMerger.sh does not correctly adjust the paths when the input PEARs are already a merge. On first run `runPearMerger.sh ae1.pear ae2.pear -n ae12` the paths to resources get adjusted from $main_root/X to $main_root/ae1/X or $main_root/ae2/X respectively. However, on subsequent `runPearMerger.sh ae12.pear ae3.pear -n ae123` only the top level paths appear to get adjusted. Should I file an issue, or is that a known and accepted limitation of PearMerger? Thanks, Jens