Re: run existing AE instance on different view

2018-07-16 Thread Jens Grivolla
Hi Marshall,

as far as I can tell all the mapping methods described there need to be
applied *before* instantiating an AE. The problem is that while I can use
CAS.getView(...) or JCas.getView(...) to access the desired view I find no
way to call the process() method of an existing AE instance on it.

One application we have right now is having a pretty memory-heavy pipeline
loaded into memory that we need to apply to texts from different sources
(typically as a web service). Depending on the source we may need to first
apply translation, cleanup, etc., all of which create new views on which to
operate. We are not using CPE or any other "standard" execution engine but
rather create the initial JCas from the incoming text and then apply
aggregate engines (using their process() method) to that JCas as needed.

Best,
Jens

On Mon, Jul 9, 2018 at 10:46 PM, Marshall Schor  wrote:

> Hi,
>
> Is anything in
> https://uima.apache.org/d/uimaj-2.10.2/tutorials_and_
> users_guides.html#ugr.tug.mvs.name_mapping_application
> helpful?
>
> If not, could you add some details that says why not?
>
> -Marshall
>
>
> On 7/5/2018 8:52 AM, Jens Grivolla wrote:
> > Hi,
> >
> > I'm trying to run an already instantiated AE on a view other than
> > _InitialView. Unfortunately, I can't just call process() on the desired
> > view, as there is a call to Util.getStartingView(...)
> > in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.
> >
> > The view mapping methods I found (e.g. using and AggregateBuilder) work
> on
> > AE descriptions, so I would need to create additional instances (with the
> > corresponding memory overhead). Is there a way to remap/rename the views
> in
> > a JCas before calling process() so that the desired view is seen as the
> > _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
> > for this, but it doesn't feel quite right.
> >
> > Best,
> > Jens
> >
>
>


Re: run existing AE instance on different view

2018-07-16 Thread Jens Grivolla
Hi Eddie,

unfortunately for the most part we can't (easily) change the AEs to make
them SofA-aware (many of them come from DKPro).

If no better solutions come up, I guess we will go with copying the view to
be processed so it is always accessible the same way (either as
_InitialView or with a different name that we always statically map to
_InitialView).

Thanks,
Jens

On Tue, Jul 10, 2018 at 3:58 PM, Eddie Epstein  wrote:

> I think the UIMA code uses the annotator context to map the _InitialView
> and the context remains static for the life of the annotator. Replicating
> annotators to handle different views has been used here too, but agree it
> is ugly.
>
> If the annotator code can be changed, then one approach would be to put
> some information in a fixed _IntialView that specifies which named view(s)
> should be analyzed and have all down stream annotators use that to select
> the view(s) to operate on.
>
> Also sounds possible to have a single new component use the cascopier to
> create a new view that is always the one processed.
>
> Regards,
> Eddie
>
> On Thu, Jul 5, 2018 at 8:52 AM, Jens Grivolla  wrote:
>
> > Hi,
> >
> > I'm trying to run an already instantiated AE on a view other than
> > _InitialView. Unfortunately, I can't just call process() on the desired
> > view, as there is a call to Util.getStartingView(...)
> > in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.
> >
> > The view mapping methods I found (e.g. using and AggregateBuilder) work
> on
> > AE descriptions, so I would need to create additional instances (with the
> > corresponding memory overhead). Is there a way to remap/rename the views
> in
> > a JCas before calling process() so that the desired view is seen as the
> > _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
> > for this, but it doesn't feel quite right.
> >
> > Best,
> > Jens
> >
>


run existing AE instance on different view

2018-07-05 Thread Jens Grivolla
Hi,

I'm trying to run an already instantiated AE on a view other than
_InitialView. Unfortunately, I can't just call process() on the desired
view, as there is a call to Util.getStartingView(...)
in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.

The view mapping methods I found (e.g. using and AggregateBuilder) work on
AE descriptions, so I would need to create additional instances (with the
corresponding memory overhead). Is there a way to remap/rename the views in
a JCas before calling process() so that the desired view is seen as the
_InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
for this, but it doesn't feel quite right.

Best,
Jens


Re: Run an analysis engine after processing document collection?

2017-12-23 Thread Jens Grivolla
Hi Ben,

if I understand correctly you want to run a process once the whole
collection has been analyzed. You can have an AnalysisEngine that does this
by implementing
http://uima.apache.org/d/uimaj-2.10.0/apidocs/org/apache/uima/analysis_engine/AnalysisEngine.html#collectionProcessComplete()

You just need to make sure that you gather all the necessary information
somehow. If the AE that calculates the statistics is at the end of the
pipeline and you have only one instance of it it's easy to gather all the
information there. Or you could just write everything you need to a
centralized datastore (i.e. a database) and use that to calculate the
statistics.

If I didn't misunderstand you, that's really a quite common scenario.

Best,
Jens

On Fri, Dec 22, 2017 at 6:26 PM, Benedict Holland <
benedict.m.holl...@gmail.com> wrote:

> Hello All,
>
> I find myself in a strange situation. I have a content processing engine
> working. I have N threads populating N CAS objects and running my pipeline.
> Each CAS object gets 1 piece of data, like say a row in a database. Each
> process is entirely independent and can run concurrently. I specifically
> did not configure this pipeline as an aggregate process as I don't really
> care when the events trigger since the CPE maintains the order of the
> engines.
>
> Now I want to add an analysis that will run over the aggregate output. For
> example, I processed N texts using the CPE and now I want to run a TF-IDF
> analysis over the entire corpora. The TF-IDF analysis should only run once
> all documents are processed.
>
> How would I go about doing this? Does this have to do with not allowing
> multiple deployments?
>
> Thanks,
> ~Ben
>


Re: Parameters for PEAR

2017-12-13 Thread Jens Grivolla
Is there a specific reason to use PEARs?

As far as I remember (but I could be wrong, it's been a few years), the
main advantages of using them (automatic class path configuration, some
degree of isolation between components) was lost when we wanted to change
configuration parameters because then we would need to use the AE
descriptor instead of the PEAR descriptor (at least with CPE). If you're
not going to use the PEAR descriptor then an installed PEAR is not much
more than a bunch of JARs, and component descriptors with tons of
hard-coded absolute file paths, so you should be able to just use and
configure a component based on those descriptors (without anything
PEAR-specific).

We have since switched to doing everything with uimaFIT which gives you
many many possibilities to adapt your workflow, configure engines
programatically, etc. For us the change has been hugely positive, both for
development (and debugging) and for deployment in a wide variety of ways
and environments.

Best,
Jens

On Tue, Dec 12, 2017 at 8:39 AM, Matthias Koch 
wrote:

> Hi,
>
> I want to configure a PEAR dynamically. (I install the pear and want to
> produce the analysis engine with different parameters than in the xml).
> Is this possible? Can I use the additionalParameters? I have seen that the
> PearSpecifier has an instance variable for parameters, but no one is using
> (calling) it.
>
> I want to produce the analysisEngine with: 
> UIMAFramework.produceAnalysisEngine(resourceSpecifer,
> resourceManager, params);
>
> In this specifier there should be one or more pearSpecifiers that should
> be configured.
>
> I have overridden the PearAnalysisEngineWrapper and built a loop that
> configures the following specifier over the configurationParameterSettings.
> It takes the parameters from the pear specifiers.
>
> line 257-258
> // Parse the resource specifier
> ResourceSpecifier specifier = UIMAFramework.getXMLParser().p
> arseResourceSpecifier(in);
>
> ==> added code
> AnalysisEngineDescription analysisEngineDescription =
> (AnalysisEngineDescription) specifier;
> AnalysisEngineMetaData analysisEngineMetaData =
> analysisEngineDescription.getAnalysisEngineMetaData();
> ConfigurationParameterSettings configurationParameterSettings =
> analysisEngineMetaData.getConfigurationParameterSettings();
> for (Parameter parameter : Arrays.asList(pearSpec.getParameters())) {
>
> configurationParameterSettings.setParameterValue(parameter.getName(),
> parameter.getValue());
> }
>
> Is it possible without overriding anything?
>
> UIMAJ Version: 2.10
>
> Sincerely
> Matthias
>
> --
> Matthias Koch
>
> Averbis GmbH
> Tennenbacher Str. 11
> 79106 Freiburg
> Germany
>
> Fon: +49 761 708 394 0
> Fax: +49 761 708 394 10
> Email:matthias.k...@averbis.com
> Web:https://averbis.com
>
> Headquarters: Freiburg im Breisgau
> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>
>


Re: General question about UimaFIT

2016-09-09 Thread Jens Grivolla
And I guess you don't get JCAS classes for your type system without going
through JCasGen, which is another disadvantage to generating the types on
the fly. It also kind of goes against the fact that the type system should
be something you can rely on for communication between components, so it
would tend to be static.

Just out of curiosity, what's the use case for this (except maybe unit
testing as Armin mentioned)?

Best,
Jens

On Fri, Sep 9, 2016 at 4:31 PM, Richard Eckart de Castilho 
wrote:

> On 09.09.2016, at 13:39, Asher Stern  wrote:
> >
> > Hi Armin.
> > Thanks for your quick answer!
> >
> > While the workaround is indeed helpful, I am still curios why is there no
> > regular mechanism to define new types and create new descriptors
> > programmatically, much like all other UIMA components?
>
> Sure you can define types programmatically... it's just that for the
> case of types, defining them through XML is actually more convenient.
> Mind that the type-system is implementation independent! You can think
> of it as of an DTD or XSD.
>
> If you want to programmatically create a type, you can do this:
>
>   TypeSystemDescription tsd = new TypeSystemDescription_impl();
>   TypeDescription tokenTypeDesc = tsd.addType("Token", "",
> CAS.TYPE_NAME_ANNOTATION);
>   tokenTypeDesc.addFeature("length", "", CAS.TYPE_NAME_INTEGER);
>
>   CAS cas = CasCreationUtils.createCas(tsd, null, null);
>   cas.setDocumentText("This is a test.");
>
> Check out [1] slides 20 following.
>
> Cheers,
>
> -- Richard
>
> [1] https://github.com/dkpro/dkpro-tutorials/blob/master/
> GSCL2013/tags/latest/slides/GSCL2013UIMATutorialUKP.pdf


Re: CPE memory usage

2016-08-29 Thread Jens Grivolla
Hi Armin, glad I could help. Getting all IDs first also avoids problems
with changing data which could mess with the offsets. This way you have a
fixed snapshot of all existing documents (at the beginning).

Best,
Jens

On Mon, Aug 29, 2016 at 8:12 AM, <armin.weg...@bka.bund.de> wrote:

> Hi Jens,
>
> I just want to confirm your information. As you said, the query gets
> slower the larger start is, even using filters. The best solution is to get
> all ids first (may take some time), and then to get each documents by id
> successively. There is a request handler (get) and a Java API method
> (HttpSolrClient.getById()) to do so.
>
> Thanks to your help, I have a constantly fast queries, now.
>
> Cheers,
> Armin
>
> -Ursprüngliche Nachricht-
> Von: j...@grivolla.net [mailto:j...@grivolla.net] Im Auftrag von Jens
> Grivolla
> Gesendet: Dienstag, 16. August 2016 13:34
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Solr is known not to be very good at deep paging, but rather getting the
> top relevant results. Running a query asking for the millionth document is
> pretty much the worst you can do as it will have to rank all documents
> again, up to the millionth, and return that one. It can also be unreliable
> if your document collection changes.
>
> We did get it to work quite well, though. I believe we used only filters
> and retrieved the results in natural order, so that Solr wouldn't have to
> rank the documents. We also had a version where we first retrieved all
> matching document ids in one go, and then queried for the documents by id,
> one by one, in getNext().
>
> Deep paging has also seen some major improvements over time IIRC, so newer
> Solr versions should perform much better than the ones from a few years
> ago.
>
> Best,
> Jens
>
> On Tue, Aug 9, 2016 at 12:20 PM, <armin.weg...@bka.bund.de> wrote:
>
> > Hi!
> >
> > Finally, it looks like that Solr causes the high memory consumption. The
> > SolrClient isn't expected to be used like I did it. But it isn't
> documented
> > either. The Solr documentation is very bad. I just happened to find a
> > solution on the web by accident.
> >
> > Thanks,
> > Armin
> >
> > -Ursprüngliche Nachricht-
> > Von: Richard Eckart de Castilho [mailto:r...@apache.org]
> > Gesendet: Montag, 8. August 2016 15:33
> > An: user@uima.apache.org
> > Betreff: Re: CPE memory usage
> >
> > Do you have code for a minimal test case?
> >
> > Cheers,
> >
> > -- Richard
> >
> > > On 08.08.2016, at 15:31, <armin.weg...@bka.bund.de> <
> > armin.weg...@bka.bund.de> wrote:
> > >
> > > Hi Richard!
> > >
> > > I've changed the document reader to a kind of no-op-reader, that always
> > sets the document text to an empty string: same behavior, but much slower
> > increase in memory usage.
> > >
> > > Cheers,
> > > Armin
> >
> >
>


Re: CPE memory usage

2016-08-16 Thread Jens Grivolla
Solr is known not to be very good at deep paging, but rather getting the
top relevant results. Running a query asking for the millionth document is
pretty much the worst you can do as it will have to rank all documents
again, up to the millionth, and return that one. It can also be unreliable
if your document collection changes.

We did get it to work quite well, though. I believe we used only filters
and retrieved the results in natural order, so that Solr wouldn't have to
rank the documents. We also had a version where we first retrieved all
matching document ids in one go, and then queried for the documents by id,
one by one, in getNext().

Deep paging has also seen some major improvements over time IIRC, so newer
Solr versions should perform much better than the ones from a few years ago.

Best,
Jens

On Tue, Aug 9, 2016 at 12:20 PM,  wrote:

> Hi!
>
> Finally, it looks like that Solr causes the high memory consumption. The
> SolrClient isn't expected to be used like I did it. But it isn't documented
> either. The Solr documentation is very bad. I just happened to find a
> solution on the web by accident.
>
> Thanks,
> Armin
>
> -Ursprüngliche Nachricht-
> Von: Richard Eckart de Castilho [mailto:r...@apache.org]
> Gesendet: Montag, 8. August 2016 15:33
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Do you have code for a minimal test case?
>
> Cheers,
>
> -- Richard
>
> > On 08.08.2016, at 15:31,  <
> armin.weg...@bka.bund.de> wrote:
> >
> > Hi Richard!
> >
> > I've changed the document reader to a kind of no-op-reader, that always
> sets the document text to an empty string: same behavior, but much slower
> increase in memory usage.
> >
> > Cheers,
> > Armin
>
>


Re: Selecting all connected annotations by type.

2015-10-26 Thread Jens Grivolla
Ok Richard, I'll look into it, but I don't promise anything at this point
(tons of project deliverables coming up)...

-- Jens

On Fri, Oct 23, 2015 at 2:03 PM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> Hi Jens,
>
> :) don't you want to test and apply it? My next projected time slot for
> uimaFIT is in December.
>
> Best,
>
> -- Richard
>
> > On 23.10.2015, at 11:09, Jens Grivolla <j+...@grivolla.net> wrote:
> >
> > I'd really like to have that functionality also (we'll need to do
> something
> > like that quite soon), so I just voted on the issue...
> >
> > I haven't tested the patch yet. José, have you been using this over the
> > last few months?
> >
> > -- Jens
>
>


Re: Selecting all connected annotations by type.

2015-10-23 Thread Jens Grivolla
I'd really like to have that functionality also (we'll need to do something
like that quite soon), so I just voted on the issue...

I haven't tested the patch yet. José, have you been using this over the
last few months?

-- Jens

On Sun, Feb 1, 2015 at 2:04 AM, José Tomás Atria  wrote:

> Issue created, patch submitted.
>
> https://issues.apache.org/jira/browse/UIMA-4212
>
> On Sat Jan 31 2015 at 3:12:33 AM Richard Eckart de Castilho <
> r...@apache.org>
> wrote:
>
> > Dear José,
> >
> > could you please re-submit the patch via the Apache UIMA issue tracker:
> >
> > Thanks!
> >
> > -- Richard
> >
> > https://issues.apache.org/jira/browse/UIMA
> >
> > On 31.01.2015, at 05:38, José Tomás Atria  wrote:
> >
> > > Please disregard the previous patch, apparently I managed to corrupt it
> > while creating it over ssh.
> > >
> > > The version in this email should be correct, I hope.
> > >
> > > Best,
> > > jta
> >
> >
>


Re: Views or Separate CASes?

2015-08-31 Thread Jens Grivolla
Hi Matt,

As Richard said, using Views is more designed for having "parallel"
information, such as separate layers of audio, transcript, video, etc.
referring to the same content or "document".

I'm not quite sure why you want to split your document for processing
(which you could do with a CAS Multiplier). Wouldn't it be much easier to
just maintain and process it as one document, marking the different
segments with e.g. speaker information, etc.? I don't quite understand your
need for splitting, your AEs can run on all the segments (and most can be
instructed not to cross segment boundaries or only work at the sentence
level anyway).

Of course if what you want is to be able to search for and retrieve
segments that pertain to different speakers then you will need to index
your content in something like Solr outside of UIMA, and while you could
use a CAS Multiplier and then index each generated CAS as a document, it is
much easier to just have a CasConsumer that knows how to deal with your
segment annotations and extracts the information you want to index in an
appropriate form.

You may want to look at our project EUMSSI (http://eumssi.eu/) which is
about doing exactly this. You can find our initial design here:
http://www.aclweb.org/anthology/W14-5212 which we presented at the last
UIMA workshop (http://glicom.upf.edu/OIAF4HLT/) and some more documentation
on https://github.com/EUMSSI/EUMSSI-platform/wiki.

The segment indexing is not in there yet, but I expect to put something on
Github in the next one or two weeks.

Best,
Jens

On Wed, Aug 26, 2015 at 4:45 PM, Matthew DeAngelis 
wrote:

> Hello UIMA Gurus,
>
> I am relatively new to UIMA, so please excuse the general nature of my
> question and any butchering of the terminology.
>
> I am attempting to write an application to process transcripts of audio
> files. Each "raw" transcript is in its own HTML file with a section listing
> biographical information for the speakers on the call followed by a number
> of sections containing transcriptions of the discussion of different
> topics. I would like to be able to analyze each speaker's contributions
> separately by topic and then aggregate and compare these analyses between
> speakers and between each speaker and the full text. I was thinking that I
> would break the document into a new segment each time the speaker or the
> section of the document changes (attaching relevant speaker metadata to
> each section), run additional Analysis Engines on each segment (tokenizer,
> etc.), and then arbitrarily recombine the results of the analysis by
> speaker, etc.
>
> Looking through the documentation, I am considering two approaches:
>
> 1. Using a CAS Multiplier. Under this approach, I would follow the example
> in Chapter 7 of the documentation, divide on section and speaker
> demarcations, add metadata to each CAS, run additional AEs on the CASes,
> and then use a multiplier to recombine the many CASes for each document
> (one for the whole transcript, one for each section, one for each speaker,
> etc.). The advantage of this approach is that it seems easy to incorporate
> into a pipeline of AEs, since they are designed to run on each CAS. The
> disadvantage is that it seems unwieldy to have to keep track of all of the
> related CASes per document and aggregate statistics across the CASes.
>
> 2. Use CAS Views. This option is appealing because it seems like CAS Views
> were designed for associating many different aspects of the same document
> with one another. However, it looks to me that I would have to specify
> different views both when parsing the document into sections and when
> passing them through subsequent AEs, which would make it harder to drop
> into an existing pipeline. I may be misunderstanding how subsequent AEs
> work with Views, however.
>
> For those more experience with UIMA, how would you approach this problem?
> It's entirely possible that I am missing a third (fourth, fifth...)
> approach that would work better than either of those above, so any guidance
> would be much appreciated.
>
>
> Regards and thanks,
> Matt
>


Re: Dictionary Matching using Concept Mapper for single word entry.

2015-07-23 Thread Jens Grivolla
Hi Khirod,

could it be that your single-word document doesn't get marked as a
sentence? You have SpanFeatureStructure set to com.naukri.parse.type.Sentence,
so ConceptMapper only works on things that are within a Sentence
annotation. Tokens that are not part of a sentence will not be seen at all.

This has happened to us when working on malformed text where some sentence
segmenters would leave parts of the text unmarked.

Best,
Jens

On Sun, Jul 19, 2015 at 4:00 PM, Khirod Kant Naik kkantn...@gmail.com
wrote:

 Hi everyone,

 I am unable to match text from dictionary if the enclosing span contains
 only a single token.

 For example - I am trying to match word education from my dictionary and
 for the enclosing span I am using a sentence. So if sentence contains a
 single token then I am not able to match it from dictionary.

 Here is what I have tried,

 When I have a sentence like - Education **something else** then
 conceptMapper matches education.
 While if I have a sentence like - Education then conceptMapper is not
 picking it from dictionary.

 So I have a question that *does conceptMapper requires you to have more
 than 1 TokenAnnotation within the specified spanFeatureStructure ? *

 P.S : This is the descriptor I am using

 ?xml version=1.0 encoding=UTF-8?
  taeDescription xmlns=http://uima.apache.org/resourceSpecifier;
frameworkImplementationorg.apache.uima.java/frameworkImplementation
primitivetrue/primitive
 
 
 annotatorImplementationNameorg.apache.uima.conceptMapper.ConceptMapper/annotatorImplementationName
analysisEngineMetaData
  nameSegment Heading Annotator/name
  description/
  version1/version
  vendor/
  configurationParameters
configurationParameter
  namecaseMatch/name
  descriptionthis parameter specifies the case folding mode:
  ignoreall - fold everything to lowercase for
  matching insensitive - fold only tokens with initial
  caps to lowercase digitfold - fold all (and only)
  tokens with a digit sensitive - perform no case
  folding/description
  typeString/type
  multiValuedfalse/multiValued
  mandatorytrue/mandatory
/configurationParameter
configurationParameter
  nameStemmer/name
  descriptionName of stemmer class to use before matching. MUST
  have a zero-parameter constructor! If not specified,
  no stemming will be performed./description
  typeString/type
  multiValuedfalse/multiValued
  mandatoryfalse/mandatory
/configurationParameter
configurationParameter
  nameResultingAnnotationName/name
  descriptionName of the annotation type created by this TAE,
  must match the typeSystemDescription
  entry/description
  typeString/type
  multiValuedfalse/multiValued
  mandatorytrue/mandatory
/configurationParameter
configurationParameter
  nameResultingEnclosingSpanName/name
  descriptionName of the feature in the resultingAnnotation to
  contain the span that encloses it (i.e. its
  sentence)/description
  typeString/type
  multiValuedfalse/multiValued
  mandatoryfalse/mandatory
/configurationParameter
configurationParameter
  nameAttributeList/name
  descriptionList of attribute names for XML dictionary entry
  record - must correspond to FeatureList/description
  typeString/type
  multiValuedtrue/multiValued
  mandatorytrue/mandatory
/configurationParameter
configurationParameter
  nameFeatureList/name
  descriptionList of feature names for CAS annotation - must
  correspond to AttributeList/description
  typeString/type
  multiValuedtrue/multiValued
  mandatorytrue/mandatory
/configurationParameter
configurationParameter
  nameTokenAnnotation/name
  description/
  typeString/type
  multiValuedfalse/multiValued
  mandatorytrue/mandatory
/configurationParameter
configurationParameter
  nameTokenClassFeatureName/name
  descriptionName of feature used when doing lookups against
  IncludedTokenClasses and
  ExcludedTokenClasses/description
  typeString/type
  multiValuedfalse/multiValued
  mandatoryfalse/mandatory
/configurationParameter
configurationParameter
  nameTokenTextFeatureName/name
  description/
  typeString/type
  multiValuedfalse/multiValued
  mandatoryfalse/mandatory
/configurationParameter
configurationParameter
  nameSpanFeatureStructure/name
  

Re: UIMAfit analysis descriptions appear to trim String configuration parameters

2015-06-16 Thread Jens Grivolla
On Mon, Jun 15, 2015 at 8:43 AM, Mario Gazzo mario.ga...@gmail.com wrote:

 I am referring to to this Github repo:

 https://github.com/apache/uima-uimafit 
 https://github.com/apache/uima-uimafit

 Thought it was published by you as a mirror of the SVN repo or the other
 way around.


Yes, this is the official (one-way) mirror of the SVN repository. If you
want to be able to reference SVN commits you can look at the commit details
on Github:
https://github.com/apache/uima-uimafit/commit/e9b32e30895443b9f93fef65453593dd1533c7d0

There you see:
git-svn-id: https://svn.apache.org/repos/asf/uima/uimafit/trunk@1681410
13f79535-47bb-0310-9956-ffa450edef68

Unfortunately, the link doesn't actually work with the repository browser
at svn.apache.org, but at least the commit id should be correct. The
correspondence between commits in SVN and git is a bit complicated because
there is only one big SVN repository for all of UIMA, whereas there are
separate git repositories for the subprojects. Therefore the commit you
reference is the latest one in the uimaFIT git repository, but there are
newer commits in the UIMA SVN.

HTH,
Jens


Re: Approach for keeping track of formatting associated with text views

2015-03-12 Thread Jens Grivolla
Hi Peter, while I don't think I will be using the HtmlConverter right away,
I would vote for using the length of the document annotation for
annotations that relate to the whole document (such as metadata).  That
makes them show up nicely in the CasEditor/Viewer and you could maintain it
in all segments when you split a CAS (e.g. with something based on the
SimpleTextSegmenter example).

-- Jens

On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl pklu...@uni-wuerzburg.de
wrote:

 Hi,

 there is no way yet to customize this behavior. The HtmlConverter only
 retains annotation of a length  0 since annoations with length == 0 are
 rather problematic and should be avoided.

 I can add a configuration parameter for keeping these annoations if you
 want (best open an issue for it). What should be the offsets of the
 annotations for elements in the head of the html document? 0, those of the
 first token or those of the document annotation?

 Best,

 Peter


 Am 06.03.2015 um 14:00 schrieb Mario Gazzo:

  We conducted some experiments with both the HtmlAnnotator and the
 HtmlConverter but we ran into an issue with the converter. It appears to
 only convert tag annotations that surround or are inside the body tag.
 Metadata elements like citations are ignored. The only way to get around
 this seems to be by forking and modifying the codebase, which I like to
 avoid. Both modules seem otherwise very useful to us but I am looking for a
 better approach to solve this issue. Is there some way to customise this
 behaviour without code modifications?

 Your input is appreciated, thanks.


  On 18 Feb 2015, at 23:03 , Mario Gazzo mario.ga...@gmail.com wrote:

 Thanks. Looks interesting, seems that it could fit our use case. We will
 have a closer look at it.

  On 18 Feb 2015, at 21:58 , Peter Klügl pklu...@uni-wuerzburg.de
 wrote:

 Hi,

 you might want to take a look at two analysis engines of UIMA Ruta:
 HtmlAnnotator and HtmlConverter [1]

 The former one creates annotations for html element and therefore also
 for xml tags. The latter one creates a new view with only the plain text
 and adds existing annotations while adapting their offsets to the new
 document.

 Best,

 Peter

 [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
 ugr.tools.ruta.ae.html

 Am 18.02.2015 um 21:46 schrieb Mario Gazzo:

 We are starting to use the UIMA framework for NL processing article
 text, which is usually stored with metadata in some XML format. We need to
 extract text elements to be processed by various NL analysis engines that
 only work with pure text but we also need to keep track of the formatting
 information related to the processed text. It is in general also valuable
 for us to be able to track every annotation back to the original XML to
 maintain provenance. Before embarking on this I like to validate our
 approach with more experienced users since this is the first application 
 we
 are building with UIMA.

 In the first step we would annotate every important element of the XML
 including formatting elements in the body. We maintain some DOM-like
 relationships between the body text and formatting annotations so that 
 text
 formatting can be reproduced later with NLP annotations in some article
 viewer.

 Next we would in another AE produce a pure text view of the text
 annotations in the XML view that need to be NL analysed. In this new text
 view we would annotate the different text elements with references back to
 their counterpart in the original XML view so that we can trace back
 positions in the original XML and the formatting relations. This of course
 will require mapping NLP annotation offsets in the text view back to the
 XML view but the information should then be there to make this possible.

 This approach requires somewhat more handcrafted book keeping than we
 initially hoped would be necessary. We haven’t been able to find any
 examples of how this is usually done and the UIMA docs are vague regarding
 managing this kind of relationships across views. We would therefore 
 really
 like to know if there is a simpler and better approach.

 Any feedback is greatly appreciated. Thanks.





Re: Ruta parallel execution

2014-12-19 Thread Jens Grivolla
Hi Silvestre,

there doesn't seem to be anything RUTA-specific in your question. In
principle, UIMA-AS allows parallel scaleout and merges the results (though
I personally have never used it this way), but there are of course a few
things to take into account.

First, you will of course need to properly define the dependencies between
your different analysis engines to ensure you always have all then
necessary information available, meaning that you can only run things in
parallel that are independent of one another. And then you will have to see
if the overhead from distributing your CAS to several engines running in
parallel and then merging the results is not greater than just having it in
one colocated pipeline that can pass the information more efficiently. I
guess you'll have to benchmark your specific application, but maybe
somebody with more experience can give you some general directions...

Best,
Jens

On Thu, Dec 18, 2014 at 12:26 PM, Silvestre Losada 
silvestre.los...@gmail.com wrote:

 Well let me explain.

 Ruta scripts are really good to work over output of analysis engines, each
 analysis engine will make some atomic work and using ruta rules you can
 easily work over generated annotations combine them, remove them...  What I
 need is to execute several analysis engines in parallel to improve the
 response time, so now the analysis engines are executed sequentially and I
 want to execute them in parallel, then take the output of all of them and
 apply some ruta rules to the output.

 would it be possible.

 On 17 December 2014 at 18:13, Peter Klügl pklu...@uni-wuerzburg.de
 wrote:
 
  Hi,
 
  I haven't used UIMA-AS (with ruta) in a real application yet, but I
  tested it once for an rc. Did you face any problems?
 
  Best
 
  Peter
 
  Am 17.12.2014 14:34, schrieb Silvestre Losada:
   Hi All,
  
   Is there any way to execute ruta scripts in parallel, using uima-AS
aproach? in case yes could you provide me an example.
  
   Kind regards.
  
 
 



Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT

2014-08-19 Thread Jens Grivolla
The workshop program, along with links to the full papers, is now
available: http://glicom.upf.edu/OIAF4HLT/Program.html

I'm looking forward to seeing many of you there.  I'll be staying at DCU
(College Park).

-- Jens


On Tue, Jul 1, 2014 at 6:52 PM, Jens Grivolla j+...@grivolla.net wrote:

 The list of accepted papers is now available:
 http://glicom.upf.edu/OIAF4HLT/Papers.html

 For anybody interested in attending the workshop and COLING, please
 remember that the early registration deadline is tomorrow, July 2nd.

 Looking forward to seeing many of you there...

 -- Jens


 On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote:

 Workshop on Open Infrastructures and Analysis Frameworks for HLT
 

 http://glicom.upf.edu/OIAF4HLT/

 At the 25th International Conference on Computational Linguistics (COLING
 2014)
 Helix Conference Centre at Dublin City University (DCU)
 23-29 August 2014

 Description
 ---

 Recent advances in digital storage and networking, coupled with the
 extension of human language technologies (HLT) into ever broader areas and
 the persistence of difficulties in software portability, have led to an
 increased focus on development and deployment of web-based infrastructures
 that allow users to access tools and other resources and combine them to
 create novel solutions that can be efficiently composed, tuned, evaluated,
 disseminated and consumed. This in turn engenders collaborative development
 and deployment among individuals and teams across the globe. It also
 increases the need for robust, widely available evaluation methods and
 tools, means to achieve interoperability of software and data from diverse
 sources, means to handle licensing for limited access resources distributed
 over the web, and, perhaps crucially, the need to develop strategies for
 multi-site collaborative work.

 For many decades, NLP has suffered from low software engineering
 standards causing a limited degree of re-usability of code and
 interoperability of different modules within larger NLP systems. While this
 did not really hamper success in limited task areas (such as implementing a
 parser), it caused serious problems for building complex integrated
 software systems, e.g., for information extraction or machine translation.
 This lack of integration has led to duplicated software development,
 work-arounds for programs written in different (versions of) programming
 languages, and ad-hoc tweaking of interfaces between modules developed at
 different sites.

 In recent years, two main frameworks, UIMA and GATE, have emerged that
 aim to allow the easy integration of varied tools through common type
 systems and standardized communication methods for components analysing
 unstructured textual information, such as natural language. Both frameworks
 offer a solid processing infrastructure that allows developers to
 concentrate on the implementation of the actual analytics components. An
 increasing number of members of the NLP community have adopted one of these
 frameworks as a platform for facilitating the creation of reusable NLP
 components that can be assembled to address different NLP tasks depending
 on their order, combination and configuration. Analysis frameworks also
 reduce the problem of reproducibility of NLP results by formalising
 solution composition and making language processing tools shareable.

 Very recently, several efforts have been devoted to the development of
 web service platforms for NLP. These platforms exploit the growing number
 of web-based tools and services available for tasks related to HLT,
 including corpus annotation, configuration and execution of NLP pipelines,
 and evaluation of results and automatic parameter tuning. These platforms
 can also integrate modules and pipelines from existing frameworks such as
 UIMA and GATE, in order to achieve interoperability with a wide variety of
 modules from different sources.

 Many of the issues and challenges surrounding these developments have
 been addressed individually in particular projects and workshops, but there
 are ramifications that cut across all of them. We therefore feel that this
 is the moment to bring together participants representing the range of
 interests that comprise the comprehensive picture for community-driven,
 distributed, collaborative, web-based development and use for language
 processing software and resources. This includes those engaged in
 development of infrastructures for HLT as well as those who will use these
 services and infrastructures, especially for multi-site collaborative work.


 ### Workshop Objectives

 The overall goal of this workshop is to provide a forum for discussion of
 the requirements for an envisaged open “global laboratory” for HLT research
 and development and establish the basis of a community effort to develop
 and support it. To this end, the workshop will include

Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT

2014-07-02 Thread Jens Grivolla
The list of accepted papers is now available:
http://glicom.upf.edu/OIAF4HLT/Papers.html

For anybody interested in attending the workshop and COLING, please
remember that the early registration deadline is tomorrow, July 2nd.

Looking forward to seeing many of you there...

-- Jens


On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote:

 Workshop on Open Infrastructures and Analysis Frameworks for HLT
 

 http://glicom.upf.edu/OIAF4HLT/

 At the 25th International Conference on Computational Linguistics (COLING
 2014)
 Helix Conference Centre at Dublin City University (DCU)
 23-29 August 2014

 Description
 ---

 Recent advances in digital storage and networking, coupled with the
 extension of human language technologies (HLT) into ever broader areas and
 the persistence of difficulties in software portability, have led to an
 increased focus on development and deployment of web-based infrastructures
 that allow users to access tools and other resources and combine them to
 create novel solutions that can be efficiently composed, tuned, evaluated,
 disseminated and consumed. This in turn engenders collaborative development
 and deployment among individuals and teams across the globe. It also
 increases the need for robust, widely available evaluation methods and
 tools, means to achieve interoperability of software and data from diverse
 sources, means to handle licensing for limited access resources distributed
 over the web, and, perhaps crucially, the need to develop strategies for
 multi-site collaborative work.

 For many decades, NLP has suffered from low software engineering standards
 causing a limited degree of re-usability of code and interoperability of
 different modules within larger NLP systems. While this did not really
 hamper success in limited task areas (such as implementing a parser), it
 caused serious problems for building complex integrated software systems,
 e.g., for information extraction or machine translation. This lack of
 integration has led to duplicated software development, work-arounds for
 programs written in different (versions of) programming languages, and
 ad-hoc tweaking of interfaces between modules developed at different sites.

 In recent years, two main frameworks, UIMA and GATE, have emerged that aim
 to allow the easy integration of varied tools through common type systems
 and standardized communication methods for components analysing
 unstructured textual information, such as natural language. Both frameworks
 offer a solid processing infrastructure that allows developers to
 concentrate on the implementation of the actual analytics components. An
 increasing number of members of the NLP community have adopted one of these
 frameworks as a platform for facilitating the creation of reusable NLP
 components that can be assembled to address different NLP tasks depending
 on their order, combination and configuration. Analysis frameworks also
 reduce the problem of reproducibility of NLP results by formalising
 solution composition and making language processing tools shareable.

 Very recently, several efforts have been devoted to the development of web
 service platforms for NLP. These platforms exploit the growing number of
 web-based tools and services available for tasks related to HLT, including
 corpus annotation, configuration and execution of NLP pipelines, and
 evaluation of results and automatic parameter tuning. These platforms can
 also integrate modules and pipelines from existing frameworks such as UIMA
 and GATE, in order to achieve interoperability with a wide variety of
 modules from different sources.

 Many of the issues and challenges surrounding these developments have been
 addressed individually in particular projects and workshops, but there are
 ramifications that cut across all of them. We therefore feel that this is
 the moment to bring together participants representing the range of
 interests that comprise the comprehensive picture for community-driven,
 distributed, collaborative, web-based development and use for language
 processing software and resources. This includes those engaged in
 development of infrastructures for HLT as well as those who will use these
 services and infrastructures, especially for multi-site collaborative work.


 ### Workshop Objectives

 The overall goal of this workshop is to provide a forum for discussion of
 the requirements for an envisaged open “global laboratory” for HLT research
 and development and establish the basis of a community effort to develop
 and support it. To this end, the workshop will include both presentations
 addressing the issues and challenges of developing, deploying, and using
 the global laboratory for distributed and collaborative efforts and
 discussion that will identify next steps for moving forward, fostering
 community-wide awareness, and establishing and encouraging

Last chance: Workshop on Open Infrastructures and Analysis Frameworks for HLT

2014-06-04 Thread Jens Grivolla
Hello all,

on request of several people who are just now getting back from LREC, we
have again extended the deadline for the Workshop on Open Infrastructures
and Analysis Frameworks for HLT.

The new paper submission deadline is June 10th, 2014

This is looking to be a very nice workshop, with a strong UIMA presence as
well as a chance to see how other frameworks deal with many of the same
issues that we encounter.

I hope to see many of you there. And thanks to those who have already
submitted their paper to the workshop. :-)

-- Jens


On Thu, May 1, 2014 at 12:13 AM, Jens Grivolla j+...@grivolla.net wrote:

 The submission deadline for the workshop was just extended significantly
 to align with some of the other COLING 2014 workshop.

 The new dates are:
 Paper Submission Deadline: 1st June 2014
 Author Notification Deadline: 30th June 2014
 Camera-Ready Paper Deadline: 10th July 2014
 Workshop: 23rd August 2014

 You can find the workshop description and CFP at
 http://glicom.upf.edu/OIAF4HLT/

 I hope to see you there and look forward to your contributions.

 -- Jens


 On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote:

 Workshop on Open Infrastructures and Analysis Frameworks for HLT
 

 http://glicom.upf.edu/OIAF4HLT/

 At the 25th International Conference on Computational Linguistics (COLING
 2014)
 Helix Conference Centre at Dublin City University (DCU)
 23-29 August 2014

 Description
 ---

 Recent advances in digital storage and networking, coupled with the
 extension of human language technologies (HLT) into ever broader areas and
 the persistence of difficulties in software portability, have led to an
 increased focus on development and deployment of web-based infrastructures
 that allow users to access tools and other resources and combine them to
 create novel solutions that can be efficiently composed, tuned, evaluated,
 disseminated and consumed. This in turn engenders collaborative development
 and deployment among individuals and teams across the globe. It also
 increases the need for robust, widely available evaluation methods and
 tools, means to achieve interoperability of software and data from diverse
 sources, means to handle licensing for limited access resources distributed
 over the web, and, perhaps crucially, the need to develop strategies for
 multi-site collaborative work.

 For many decades, NLP has suffered from low software engineering
 standards causing a limited degree of re-usability of code and
 interoperability of different modules within larger NLP systems. While this
 did not really hamper success in limited task areas (such as implementing a
 parser), it caused serious problems for building complex integrated
 software systems, e.g., for information extraction or machine translation.
 This lack of integration has led to duplicated software development,
 work-arounds for programs written in different (versions of) programming
 languages, and ad-hoc tweaking of interfaces between modules developed at
 different sites.

 In recent years, two main frameworks, UIMA and GATE, have emerged that
 aim to allow the easy integration of varied tools through common type
 systems and standardized communication methods for components analysing
 unstructured textual information, such as natural language. Both frameworks
 offer a solid processing infrastructure that allows developers to
 concentrate on the implementation of the actual analytics components. An
 increasing number of members of the NLP community have adopted one of these
 frameworks as a platform for facilitating the creation of reusable NLP
 components that can be assembled to address different NLP tasks depending
 on their order, combination and configuration. Analysis frameworks also
 reduce the problem of reproducibility of NLP results by formalising
 solution composition and making language processing tools shareable.

 Very recently, several efforts have been devoted to the development of
 web service platforms for NLP. These platforms exploit the growing number
 of web-based tools and services available for tasks related to HLT,
 including corpus annotation, configuration and execution of NLP pipelines,
 and evaluation of results and automatic parameter tuning. These platforms
 can also integrate modules and pipelines from existing frameworks such as
 UIMA and GATE, in order to achieve interoperability with a wide variety of
 modules from different sources.

 Many of the issues and challenges surrounding these developments have
 been addressed individually in particular projects and workshops, but there
 are ramifications that cut across all of them. We therefore feel that this
 is the moment to bring together participants representing the range of
 interests that comprise the comprehensive picture for community-driven,
 distributed, collaborative, web-based development and use for language
 processing software

Re: CFP: Workshop on Open Infrastructures and Analysis Frameworks for HLT

2014-05-01 Thread Jens Grivolla
The submission deadline for the workshop was just extended significantly to
align with some of the other COLING 2014 workshop.

The new dates are:
Paper Submission Deadline: 1st June 2014
Author Notification Deadline: 30th June 2014
Camera-Ready Paper Deadline: 10th July 2014
Workshop: 23rd August 2014

You can find the workshop description and CFP at
http://glicom.upf.edu/OIAF4HLT/

I hope to see you there and look forward to your contributions.

-- Jens


On Wed, Mar 26, 2014 at 2:34 PM, Jens Grivolla j+...@grivolla.net wrote:

 Workshop on Open Infrastructures and Analysis Frameworks for HLT
 

 http://glicom.upf.edu/OIAF4HLT/

 At the 25th International Conference on Computational Linguistics (COLING
 2014)
 Helix Conference Centre at Dublin City University (DCU)
 23-29 August 2014

 Description
 ---

 Recent advances in digital storage and networking, coupled with the
 extension of human language technologies (HLT) into ever broader areas and
 the persistence of difficulties in software portability, have led to an
 increased focus on development and deployment of web-based infrastructures
 that allow users to access tools and other resources and combine them to
 create novel solutions that can be efficiently composed, tuned, evaluated,
 disseminated and consumed. This in turn engenders collaborative development
 and deployment among individuals and teams across the globe. It also
 increases the need for robust, widely available evaluation methods and
 tools, means to achieve interoperability of software and data from diverse
 sources, means to handle licensing for limited access resources distributed
 over the web, and, perhaps crucially, the need to develop strategies for
 multi-site collaborative work.

 For many decades, NLP has suffered from low software engineering standards
 causing a limited degree of re-usability of code and interoperability of
 different modules within larger NLP systems. While this did not really
 hamper success in limited task areas (such as implementing a parser), it
 caused serious problems for building complex integrated software systems,
 e.g., for information extraction or machine translation. This lack of
 integration has led to duplicated software development, work-arounds for
 programs written in different (versions of) programming languages, and
 ad-hoc tweaking of interfaces between modules developed at different sites.

 In recent years, two main frameworks, UIMA and GATE, have emerged that aim
 to allow the easy integration of varied tools through common type systems
 and standardized communication methods for components analysing
 unstructured textual information, such as natural language. Both frameworks
 offer a solid processing infrastructure that allows developers to
 concentrate on the implementation of the actual analytics components. An
 increasing number of members of the NLP community have adopted one of these
 frameworks as a platform for facilitating the creation of reusable NLP
 components that can be assembled to address different NLP tasks depending
 on their order, combination and configuration. Analysis frameworks also
 reduce the problem of reproducibility of NLP results by formalising
 solution composition and making language processing tools shareable.

 Very recently, several efforts have been devoted to the development of web
 service platforms for NLP. These platforms exploit the growing number of
 web-based tools and services available for tasks related to HLT, including
 corpus annotation, configuration and execution of NLP pipelines, and
 evaluation of results and automatic parameter tuning. These platforms can
 also integrate modules and pipelines from existing frameworks such as UIMA
 and GATE, in order to achieve interoperability with a wide variety of
 modules from different sources.

 Many of the issues and challenges surrounding these developments have been
 addressed individually in particular projects and workshops, but there are
 ramifications that cut across all of them. We therefore feel that this is
 the moment to bring together participants representing the range of
 interests that comprise the comprehensive picture for community-driven,
 distributed, collaborative, web-based development and use for language
 processing software and resources. This includes those engaged in
 development of infrastructures for HLT as well as those who will use these
 services and infrastructures, especially for multi-site collaborative work.


 ### Workshop Objectives

 The overall goal of this workshop is to provide a forum for discussion of
 the requirements for an envisaged open “global laboratory” for HLT research
 and development and establish the basis of a community effort to develop
 and support it. To this end, the workshop will include both presentations
 addressing the issues and challenges of developing, deploying, and using
 the global laboratory

Re: next UIMA workshop?

2014-04-09 Thread Jens Grivolla
On Mon, Mar 31, 2014 at 10:12 PM, Marshall Schor m...@schor.com wrote:


 On 3/26/2014 9:44 AM, Jens Grivolla wrote:
  Finally, despite the fact that UIMA does not appear in the title anymore,
  would it be possible to have an announcement on the UIMA web page?

 I think so (unless others disagree).  Can you draft something?


I tried to prepare a draft for svnpubsub to see how it fits with the UIMA
site (without linking to it at first, of course), and created
uima-website/xdocs/coling14.xml
It then seems that I need to rebuild the site on my machine with ANT and
push the resulting changes in docs/, which I did.  The resulting page can
be seen at http://uima.apache.org/coling14.html and looks more or less ok.

I hope I didn't do anything wrong by committing directly to the site, but I
didn't find a good way to try it in the actual page layout and show the
results otherwise. In any case it's not linked from anywhere and shouldn't
affect any other parts of the site.

-- Jens


Re: next UIMA workshop?

2014-03-26 Thread Jens Grivolla
Hi all, I have just posted the (more or less) final CFP on uima-user and 
uima-dev.


Feel free to distribute the CFP to anybody you think would be 
interested.  While this has been merged with a different workshop and 
thus has a somewhat wider scope than just UIMA, I still view this as a 
followup to the the UIMA workshop at GSCL and would hope to have 
similarly interesting contributions from the UIMA community.


If you are a PC member, or willing to be one, please contact me off-list 
with the email address and affiliation that you would like me to use for 
this purpose.


Finally, despite the fact that UIMA does not appear in the title 
anymore, would it be possible to have an announcement on the UIMA web page?


-- Jens

On 05/02/14 11:46, Jens Grivolla wrote:

We have been asked to merge our workshop with a similar one focusing on
open infrastructures.  The result is a Workshop on Open
Infrastructures and Analysis Frameworks for HLT.

We will now start to build a common CFP from the two proposals.  All
contributions are welcome:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/cfp.md


-- Jens

On 19/01/14 15:40, Jens Grivolla wrote:

I have sent the proposal, we'll see what they say...

-- Jens

On 17/01/14 15:02, Jens Grivolla wrote:

On 15/01/14 20:51, Richard Eckart de Castilho wrote:

On 15.01.2014, at 15:10, Jens Grivolla
j+...@grivolla.net wrote:

The CFP itself must still be rewritten to be less UIMA-centric, other
than that this is starting to look quite good.


GATE developer Mark A. Greenwood did the rewrite and sent me a pull
request on Github.


For example, the topic experience reports combining UIMA-based
components from different sources, as well as solutions to
interoperability issues could be reworded as:

1) experience reports combining language analysis components from
different sources, as well as solutions to interoperability issues

2) experience reports combining different frameworks (e.g.
GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability
issues


I put both in there as separate points.


I think both aspects would be interesting. I'm a little afraid that 1)
might end up iterating the existing of frameworks like UIMA, while 2)
would end up referring over web-services or semantic web stuff for
interoperability - which may not be very interesting. I'd be more
interested in issues and solutions exist beyond this, e.g. with
regards to the interchangability of components. What problems exist
when e.g. one parser component in a workflow is replaced with a
different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]).


Agree.  Subtle semantic differences between alternative components can
be more challenging than the technical integration.  I'm not sure how to
put that in the CFP without it getting very verbose, though.


I think one more topic could be added:

- combining annotation type systems in processing frameworks (GATE,
UIMA, etc.) with standardization efforts, such as done in the ISO
TC37/SC4 or TEI contexts.


Done. Thanks for your input.

As always, the current state of the proposal can be seen on Github:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md




I think the current version is pretty close to final. If there are any
more suggestions hurry up, the deadline is approaching.

-- Jens















Re: next UIMA workshop?

2014-02-05 Thread Jens Grivolla
We have been asked to merge our workshop with a similar one focusing on 
open infrastructures.  The result is a Workshop on Open 
Infrastructures and Analysis Frameworks for HLT.


We will now start to build a common CFP from the two proposals.  All 
contributions are welcome: 
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/cfp.md


-- Jens

On 19/01/14 15:40, Jens Grivolla wrote:

I have sent the proposal, we'll see what they say...

-- Jens

On 17/01/14 15:02, Jens Grivolla wrote:

On 15/01/14 20:51, Richard Eckart de Castilho wrote:

On 15.01.2014, at 15:10, Jens Grivolla
j+...@grivolla.net wrote:

The CFP itself must still be rewritten to be less UIMA-centric, other
than that this is starting to look quite good.


GATE developer Mark A. Greenwood did the rewrite and sent me a pull
request on Github.


For example, the topic experience reports combining UIMA-based
components from different sources, as well as solutions to
interoperability issues could be reworded as:

1) experience reports combining language analysis components from
different sources, as well as solutions to interoperability issues

2) experience reports combining different frameworks (e.g.
GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability
issues


I put both in there as separate points.


I think both aspects would be interesting. I'm a little afraid that 1)
might end up iterating the existing of frameworks like UIMA, while 2)
would end up referring over web-services or semantic web stuff for
interoperability - which may not be very interesting. I'd be more
interested in issues and solutions exist beyond this, e.g. with
regards to the interchangability of components. What problems exist
when e.g. one parser component in a workflow is replaced with a
different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]).


Agree.  Subtle semantic differences between alternative components can
be more challenging than the technical integration.  I'm not sure how to
put that in the CFP without it getting very verbose, though.


I think one more topic could be added:

- combining annotation type systems in processing frameworks (GATE,
UIMA, etc.) with standardization efforts, such as done in the ISO
TC37/SC4 or TEI contexts.


Done. Thanks for your input.

As always, the current state of the proposal can be seen on Github:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md



I think the current version is pretty close to final. If there are any
more suggestions hurry up, the deadline is approaching.

-- Jens











Re: next UIMA workshop?

2014-01-19 Thread Jens Grivolla

I have sent the proposal, we'll see what they say...

-- Jens

On 17/01/14 15:02, Jens Grivolla wrote:

On 15/01/14 20:51, Richard Eckart de Castilho wrote:

On 15.01.2014, at 15:10, Jens Grivolla
j+...@grivolla.net wrote:

The CFP itself must still be rewritten to be less UIMA-centric, other
than that this is starting to look quite good.


GATE developer Mark A. Greenwood did the rewrite and sent me a pull
request on Github.


For example, the topic experience reports combining UIMA-based
components from different sources, as well as solutions to
interoperability issues could be reworded as:

1) experience reports combining language analysis components from
different sources, as well as solutions to interoperability issues

2) experience reports combining different frameworks (e.g.
GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues


I put both in there as separate points.


I think both aspects would be interesting. I'm a little afraid that 1)
might end up iterating the existing of frameworks like UIMA, while 2)
would end up referring over web-services or semantic web stuff for
interoperability - which may not be very interesting. I'd be more
interested in issues and solutions exist beyond this, e.g. with
regards to the interchangability of components. What problems exist
when e.g. one parser component in a workflow is replaced with a
different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]).


Agree.  Subtle semantic differences between alternative components can
be more challenging than the technical integration.  I'm not sure how to
put that in the CFP without it getting very verbose, though.


I think one more topic could be added:

- combining annotation type systems in processing frameworks (GATE,
UIMA, etc.) with standardization efforts, such as done in the ISO
TC37/SC4 or TEI contexts.


Done. Thanks for your input.

As always, the current state of the proposal can be seen on Github:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md


I think the current version is pretty close to final. If there are any
more suggestions hurry up, the deadline is approaching.

-- Jens







Re: next UIMA workshop?

2014-01-17 Thread Jens Grivolla

On 15/01/14 20:51, Richard Eckart de Castilho wrote:

On 15.01.2014, at 15:10, Jens Grivolla j+...@grivolla.net wrote:

The CFP itself must still be rewritten to be less UIMA-centric, other than that 
this is starting to look quite good.


GATE developer Mark A. Greenwood did the rewrite and sent me a pull 
request on Github.



For example, the topic experience reports combining UIMA-based components from 
different sources, as well as solutions to interoperability issues could be 
reworded as:

1) experience reports combining language analysis components from different 
sources, as well as solutions to interoperability issues

2) experience reports combining different frameworks (e.g. 
GATE/UIMA/WebLicht/etc.), as well as solutions to interoperability issues


I put both in there as separate points.


I think both aspects would be interesting. I'm a little afraid that 1) might 
end up iterating the existing of frameworks like UIMA, while 2) would end up 
referring over web-services or semantic web stuff for interoperability - which 
may not be very interesting. I'd be more interested in issues and solutions 
exist beyond this, e.g. with regards to the interchangability of components. 
What problems exist when e.g. one parser component in a workflow is replaced 
with a different one? How can these be solved? (Cf. Noh and Padó, 2013 [1]).


Agree.  Subtle semantic differences between alternative components can 
be more challenging than the technical integration.  I'm not sure how to 
put that in the CFP without it getting very verbose, though.



I think one more topic could be added:

- combining annotation type systems in processing frameworks (GATE, UIMA, etc.) 
with standardization efforts, such as done in the ISO TC37/SC4 or TEI contexts.


Done. Thanks for your input.

As always, the current state of the proposal can be seen on Github: 
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md


I think the current version is pretty close to final. If there are any 
more suggestions hurry up, the deadline is approaching.


-- Jens



Re: next UIMA workshop?

2014-01-15 Thread Jens Grivolla

Thanks, fixed.

On 14/01/14 19:04, Peter Klügl wrote:

Hi,

Just a small correction:
The last workshop had nine paper presentations and one invited talk.

Best,

Peter

Am 14.01.2014 18:11, schrieb Jens Grivolla:

Hello, there's only 5 days remaining to submit the workshop proposal.
Please anybody interested get in touch.

I sent a mail to the GATE user list to get some input from them.  The
proposal draft is here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md

-- Jens

On 19/12/13 13:29, Jens Grivolla wrote:

On 19/12/13 13:08, Peter Klügl wrote:

Am 19.12.2013 12:31, schrieb Jens Grivolla:

Ok, it's time to seriously get started on this.

I guess we can start with the GSCL workshop description, and maybe
make it more inclusive for other frameworks (GATE, etc.)

We need a couple of organizers (me, Renaud, ...?) and a potential PC
(again, start with the one from GSCL) preferably with a few already
confirmed PC members (Richard, ...)



If the workshop is more inclusive for other frameworks, maybe it's
reasonable to ask one of the GATE people whether they want to
co-organize the workshop.


Yes, we definitely would need to reach out to them.  First we need to
decide: do we want a more focused workshop (just UIMA), or are the
problems faced by GATE users (and others) sufficiently similar that we
can learn from each other?

If we want to get the GATE people in there: does anybody have contacts
in that community?


I won't be able to help with the organization, but maybe as a part of
the PC.


I take that as having you as a confirmed PC member ;-)


I can also not promise that I will submit something, but I will
motivate
our working group.


Ok, that's great.

I started the draft proposal here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop

Thanks,
Jens












Re: COLING 2014 - some information

2014-01-15 Thread Jens Grivolla

Dear Luca and Sylvain,

as you can see the workshop is still in the proposal phase. If it is 
accepted by the COLING organizers pricing etc. will be set by them.


It will of course be possible to attend without presenting a paper, and 
on the other hand we are open to all kinds of contributions, and in 
particular related to industry use of UIMA.


Best regards,
Jens

On 15/01/14 11:36, Sylvain Surcin wrote:

Hello,

I am also interested in joining this workshop about UIMA.
We have been running a full UIMA driven processing chain in my company for
years and are in the process of releasing some components as open source
together with University of Marne-la-Vallée (France).
It could be interesting to disseminate some info about that.

Best regards,
--
[+] Add me to your address
bookhttps://ws.writethat.name/kwaga-bin/titan/WEB/me.pl/5075409511380703595/i

Sylvain SURCIN, Ph.D.
*KWAGA*
Senior Software Architect
15, rue Jean-Baptiste Berlier
75013 Paris
France
Tél.: +33 (0)1.55.43.79.20


[+] Add me to your address
bookhttps://ws.writethat.name/kwaga-bin/titan/WEB/me.pl/5075409511380703595/i

Sylvain SURCIN, Ph.D.
*KWAGA*
Senior Software Architect
15, rue Jean-Baptiste Berlier
75013 Paris
France
Tél.: +33 (0)1.55.43.79.20


On Wed, Jan 15, 2014 at 11:16 AM, Luca Foppiano l...@foppiano.org wrote:


Dear all,
 I'm a new member of this mailing list and new user of Apache UIMA. When
starting using UIMA I'm facing exactly the problems listed in the
introduction of the web page (

https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md
).
:)

I'm very interested to join this conference/workshop, I want to know if it
is possible to join it as attendee. I'm not affiliated with any University
or research centers.

My plan is to participate to SEMEVAL, and since COLING share the location,
to join it as well. Is there any limitation or price for it?

Thanks in advance
--
Luca Foppiano

Software Engineer
+31615253280
l...@foppiano.org
www.foppiano.org








Re: next UIMA workshop?

2014-01-15 Thread Jens Grivolla
Just a quick update: the proposal is progressing nicely, with very 
positive response from the GATE people.  In fact, it will be 
co-organised by a GATE core team member and several core developers are 
on the PC.


The CFP itself must still be rewritten to be less UIMA-centric, other 
than that this is starting to look quite good.


Any input is welcome, so if you have any suggestions hurry up...

-- Jens

On 15/01/14 10:41, Jens Grivolla wrote:

Thanks, fixed.

On 14/01/14 19:04, Peter Klügl wrote:

Hi,

Just a small correction:
The last workshop had nine paper presentations and one invited talk.

Best,

Peter

Am 14.01.2014 18:11, schrieb Jens Grivolla:

Hello, there's only 5 days remaining to submit the workshop proposal.
Please anybody interested get in touch.

I sent a mail to the GATE user list to get some input from them.  The
proposal draft is here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md


-- Jens

On 19/12/13 13:29, Jens Grivolla wrote:

On 19/12/13 13:08, Peter Klügl wrote:

Am 19.12.2013 12:31, schrieb Jens Grivolla:

Ok, it's time to seriously get started on this.

I guess we can start with the GSCL workshop description, and maybe
make it more inclusive for other frameworks (GATE, etc.)

We need a couple of organizers (me, Renaud, ...?) and a potential PC
(again, start with the one from GSCL) preferably with a few already
confirmed PC members (Richard, ...)



If the workshop is more inclusive for other frameworks, maybe it's
reasonable to ask one of the GATE people whether they want to
co-organize the workshop.


Yes, we definitely would need to reach out to them.  First we need to
decide: do we want a more focused workshop (just UIMA), or are the
problems faced by GATE users (and others) sufficiently similar that we
can learn from each other?

If we want to get the GATE people in there: does anybody have contacts
in that community?


I won't be able to help with the organization, but maybe as a part of
the PC.


I take that as having you as a confirmed PC member ;-)


I can also not promise that I will submit something, but I will
motivate
our working group.


Ok, that's great.

I started the draft proposal here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop

Thanks,
Jens
















Re: next UIMA workshop?

2014-01-14 Thread Jens Grivolla
Hello, there's only 5 days remaining to submit the workshop proposal. 
Please anybody interested get in touch.


I sent a mail to the GATE user list to get some input from them.  The 
proposal draft is here: 
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md


-- Jens

On 19/12/13 13:29, Jens Grivolla wrote:

On 19/12/13 13:08, Peter Klügl wrote:

Am 19.12.2013 12:31, schrieb Jens Grivolla:

Ok, it's time to seriously get started on this.

I guess we can start with the GSCL workshop description, and maybe
make it more inclusive for other frameworks (GATE, etc.)

We need a couple of organizers (me, Renaud, ...?) and a potential PC
(again, start with the one from GSCL) preferably with a few already
confirmed PC members (Richard, ...)



If the workshop is more inclusive for other frameworks, maybe it's
reasonable to ask one of the GATE people whether they want to
co-organize the workshop.


Yes, we definitely would need to reach out to them.  First we need to
decide: do we want a more focused workshop (just UIMA), or are the
problems faced by GATE users (and others) sufficiently similar that we
can learn from each other?

If we want to get the GATE people in there: does anybody have contacts
in that community?


I won't be able to help with the organization, but maybe as a part of
the PC.


I take that as having you as a confirmed PC member ;-)


I can also not promise that I will submit something, but I will motivate
our working group.


Ok, that's great.

I started the draft proposal here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop

Thanks,
Jens







Re: next UIMA workshop?

2014-01-14 Thread Jens Grivolla
As I understand it, poster presentations are only used as a way to 
offload submissions that didn't make it as a full paper. I don't think 
that such a distinction is useful for this workshop and would prefer to 
have oral presentations for all interesting contributions.


If we expected to have significantly more contributions that can fit 
into the schedule then concentrating some of them into a poster session 
might make sense, but I don't think this is the case.


If on the other hand posters were used to get additional visibility 
outside of the workshop then this could be interesting...


-- Jens

On 14/01/14 18:36, Michael Tanenblatt wrote:

I’ll certainly be on the Program Committee, and am willing to help in any ways 
that I am able. Regarding the proposal, overall it looks pretty reasonable, but 
what is the reason for limiting to oral presentations and omitting posters?

..m

On Jan 14, 2014, at 12:11 PM, Jens Grivolla j+...@grivolla.net wrote:


Hello, there's only 5 days remaining to submit the workshop proposal. Please 
anybody interested get in touch.

I sent a mail to the GATE user list to get some input from them.  The proposal 
draft is here: 
https://github.com/jgrivolla/coling2014-nlp-framework-workshop/blob/master/proposal.md

-- Jens

On 19/12/13 13:29, Jens Grivolla wrote:

On 19/12/13 13:08, Peter Klügl wrote:

Am 19.12.2013 12:31, schrieb Jens Grivolla:

Ok, it's time to seriously get started on this.

I guess we can start with the GSCL workshop description, and maybe
make it more inclusive for other frameworks (GATE, etc.)

We need a couple of organizers (me, Renaud, ...?) and a potential PC
(again, start with the one from GSCL) preferably with a few already
confirmed PC members (Richard, ...)



If the workshop is more inclusive for other frameworks, maybe it's
reasonable to ask one of the GATE people whether they want to
co-organize the workshop.


Yes, we definitely would need to reach out to them.  First we need to
decide: do we want a more focused workshop (just UIMA), or are the
problems faced by GATE users (and others) sufficiently similar that we
can learn from each other?

If we want to get the GATE people in there: does anybody have contacts
in that community?


I won't be able to help with the organization, but maybe as a part of
the PC.


I take that as having you as a confirmed PC member ;-)


I can also not promise that I will submit something, but I will motivate
our working group.


Ok, that's great.

I started the draft proposal here:
https://github.com/jgrivolla/coling2014-nlp-framework-workshop

Thanks,
Jens













Re: next UIMA workshop?

2013-12-19 Thread Jens Grivolla

Ok, it's time to seriously get started on this.

I guess we can start with the GSCL workshop description, and maybe make 
it more inclusive for other frameworks (GATE, etc.)


We need a couple of organizers (me, Renaud, ...?) and a potential PC 
(again, start with the one from GSCL) preferably with a few already 
confirmed PC members (Richard, ...)


I'll get started with a first draft. Any input is welcome.

Please also indicate if you plan to submit an article, in order to have 
a first idea of what to expect...


Thanks,
Jens

On 21/10/13 11:44, Jens Grivolla wrote:

Hi, at GSCL 2013 we talked a bit about options for the next UIMA
workshop. How about trying to have it at COLING 2014?

 WORKSHOP TIMELINE
 • 19th January 2014: Workshop proposals due
 • 26th January 2014: Notification of workshop acceptances
 • 18th July 2014: Camera-ready deadline for workshop proceedings
 • 23rd and 24th August 2014: COLING Workshops

http://www.coling-2014.org/workshop-call.php

So that would be approximately one year after the GSCL workshop which
would probably give enough time for people to have new things to
present, and there are still 3 months before submitting the workshop
proposal.

COLING is going to be in Dublin, which makes it relatively easy to
attend for the European UIMA community.

What do you think?

Bye,
Jens







Re: big offsets efficiency, and multiple offsets

2013-12-05 Thread Jens Grivolla
I agree that it might make more sense to model our needs more directly 
instead of trying to squeeze it into the schema we normally use for text 
processing.  But at the same time I would of course like to avoid having 
to reimplement many of the things that are already available when using 
AnnotationBase.


For the cross-view indexing issue I was thinking of creating individual 
views for each modality and then a merged view that just contains a 
subset of annotations of each view, and on which we would do the 
cross-modal reasoning.


I just looked again at the GaleMultiModalExample (not much there, 
unfortunately) and saw that e.g. AudioSpan derives from AnnotationBase 
but still has float values for begin/end.  I would be really interested 
in learning more about what was done in GALE, but it's hard to find any 
relevant information...


Thanks,
Jens

On 04/12/13 20:16, Marshall Schor wrote:

Echoing Richard,

1) It would perhaps make more sense to be more direct about each of the
different types of data.  UIMA built-in only the most popular things - and
Annotation was one of them.

Annotation derives from Annotation-base, which just defines an associated Sofa /
view.

So it would make more sense to define different kinds of highest-level
abstractions for your project, related to the different kinds of views/sofas.
Audio might entail a begin / end style of offsets;  Images might entail a pair
x-y coordinates, to describe a (square) subset of an image.  Video might do
something like audio, or something more complex...

UIMA's use of the AnnotationBase includes insuring that when you add-to-indexes
(an operation that implicitly takes a view - and adds a FS to that view), that
if the FS is a subtype of AnnotationBase, then the FS must be indexed in the
associated view to which that FS belongs; if you try to add-to-index in a view
other than the one the FS was created in, you get this kind of error:

Error - the Annotation {0} is over view {1} and cannot be added to indexes
associated with the different view {2}.

The logic behind this restriction is:  an Annotation (or, more generally, an
object having a supertype of AnnotationBase) is (by definition) associated with
a particular Sofa/View,  and it is more likely that it is an error if that
annotation is indexed with a sofa it doesn't belong with.

Of course, Feature Structures which are not Annotations (or more generally, not
derived from AnnotationBase), can be indexed in multiple views.

2) By keeping separate notions for pointers-into-the-Sofa, you can define
algorithmic mappings for these that make the best sense for your project,
including notions of fuzzyness, time-shift (imagine the audio is out-of-sync
with the video, like lots of u-tube things seem to be), etc.

-Marshall


On 12/4/2013 9:31 AM, Jens Grivolla wrote:

Hi, we're now starting the EUMSSI project, which deals with integrating
annotation layers coming from audio, video and text analysis.

We're thinking to base it all on UIMA, having different views with separate
audio, video, transcribed text, etc. sofas.  In order to align the different
views we need to have a common offset specification that allows us to map e.g.
character offsets to the corresponding timestamps.

In order to avoid float timestamps (which would mean we can't derive from
Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000
frames/second.  Annotation has begin and end defined as signed 32 bit ints,
leaving sufficient room for very long documents even at 1000 fps, so I don't
think we're going to run into any limits there.  Is there anything that could
become problematic when working with offsets that are probably quite a bit
larger than what is typically found with character offsets?

Also, can I have several indexes on the same annotations in order to work with
character offsets for text analysis, but then efficiently query for
overlapping annotations from other views based on frame offsets?

Btw, if you're interested in the project we have a writeup (condensed from the
project proposal) here:
https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will
hopefully soon be some content on http://eumssi.eu/

Thanks,
Jens










Re: big offsets efficiency, and multiple offsets

2013-12-05 Thread Jens Grivolla
I forgot to say that the text analysis view(s) will necessarily have to 
use character offsets so that we can obtain the coveredText, which means 
that all resulting annotations will also use character offsets.  The 
merged view will need to use time-based offsets which means that we have 
to recreate the annotations there with mapped offsets rather than just 
index the same annotations in a different view.


I think that basically means that we won't do much cross-view querying 
but rather have one component (AE) that reads from all views and creates 
a new one with new independent annotations after mapping the offsets.


-- Jens

On 05/12/13 10:04, Jens Grivolla wrote:

I agree that it might make more sense to model our needs more directly
instead of trying to squeeze it into the schema we normally use for text
processing.  But at the same time I would of course like to avoid having
to reimplement many of the things that are already available when using
AnnotationBase.

For the cross-view indexing issue I was thinking of creating individual
views for each modality and then a merged view that just contains a
subset of annotations of each view, and on which we would do the
cross-modal reasoning.

I just looked again at the GaleMultiModalExample (not much there,
unfortunately) and saw that e.g. AudioSpan derives from AnnotationBase
but still has float values for begin/end.  I would be really interested
in learning more about what was done in GALE, but it's hard to find any
relevant information...

Thanks,
Jens

On 04/12/13 20:16, Marshall Schor wrote:

Echoing Richard,

1) It would perhaps make more sense to be more direct about each of the
different types of data.  UIMA built-in only the most popular
things - and
Annotation was one of them.

Annotation derives from Annotation-base, which just defines an
associated Sofa /
view.

So it would make more sense to define different kinds of highest-level
abstractions for your project, related to the different kinds of
views/sofas.
Audio might entail a begin / end style of offsets;  Images might
entail a pair
x-y coordinates, to describe a (square) subset of an image.  Video
might do
something like audio, or something more complex...

UIMA's use of the AnnotationBase includes insuring that when you
add-to-indexes
(an operation that implicitly takes a view - and adds a FS to that
view), that
if the FS is a subtype of AnnotationBase, then the FS must be indexed
in the
associated view to which that FS belongs; if you try to add-to-index
in a view
other than the one the FS was created in, you get this kind of error:

Error - the Annotation {0} is over view {1} and cannot be added to
indexes
associated with the different view {2}.

The logic behind this restriction is:  an Annotation (or, more
generally, an
object having a supertype of AnnotationBase) is (by definition)
associated with
a particular Sofa/View,  and it is more likely that it is an error if
that
annotation is indexed with a sofa it doesn't belong with.

Of course, Feature Structures which are not Annotations (or more
generally, not
derived from AnnotationBase), can be indexed in multiple views.

2) By keeping separate notions for pointers-into-the-Sofa, you can define
algorithmic mappings for these that make the best sense for your project,
including notions of fuzzyness, time-shift (imagine the audio is
out-of-sync
with the video, like lots of u-tube things seem to be), etc.

-Marshall


On 12/4/2013 9:31 AM, Jens Grivolla wrote:

Hi, we're now starting the EUMSSI project, which deals with integrating
annotation layers coming from audio, video and text analysis.

We're thinking to base it all on UIMA, having different views with
separate
audio, video, transcribed text, etc. sofas.  In order to align the
different
views we need to have a common offset specification that allows us to
map e.g.
character offsets to the corresponding timestamps.

In order to avoid float timestamps (which would mean we can't derive
from
Annotation) I was thinking of using audio/video frames with e.g. 100
or 1000
frames/second.  Annotation has begin and end defined as signed 32 bit
ints,
leaving sufficient room for very long documents even at 1000 fps, so
I don't
think we're going to run into any limits there.  Is there anything
that could
become problematic when working with offsets that are probably quite
a bit
larger than what is typically found with character offsets?

Also, can I have several indexes on the same annotations in order to
work with
character offsets for text analysis, but then efficiently query for
overlapping annotations from other views based on frame offsets?

Btw, if you're interested in the project we have a writeup (condensed
from the
project proposal) here:
https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there
will
hopefully soon be some content on http://eumssi.eu/

Thanks,
Jens














big offsets efficiency, and multiple offsets

2013-12-04 Thread Jens Grivolla
Hi, we're now starting the EUMSSI project, which deals with integrating 
annotation layers coming from audio, video and text analysis.


We're thinking to base it all on UIMA, having different views with 
separate audio, video, transcribed text, etc. sofas.  In order to align 
the different views we need to have a common offset specification that 
allows us to map e.g. character offsets to the corresponding timestamps.


In order to avoid float timestamps (which would mean we can't derive 
from Annotation) I was thinking of using audio/video frames with e.g. 
100 or 1000 frames/second.  Annotation has begin and end defined as 
signed 32 bit ints, leaving sufficient room for very long documents even 
at 1000 fps, so I don't think we're going to run into any limits there. 
 Is there anything that could become problematic when working with 
offsets that are probably quite a bit larger than what is typically 
found with character offsets?


Also, can I have several indexes on the same annotations in order to 
work with character offsets for text analysis, but then efficiently 
query for overlapping annotations from other views based on frame offsets?


Btw, if you're interested in the project we have a writeup (condensed 
from the project proposal) here: 
https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there 
will hopefully soon be some content on http://eumssi.eu/


Thanks,
Jens



Re: big offsets efficiency, and multiple offsets

2013-12-04 Thread Jens Grivolla
True, but don't things like selectCovered() etc. expect Annotations (to 
match on begin/end)? So using Annotation might make it easier in some 
cases to select the annotations we're interested in.


-- Jens

On 04/12/13 15:35, Richard Eckart de Castilho wrote:

Why is it bad if you cannot inherit from Annotation? The getCoveredText() will 
not work anyway if you are working with audio/video data.

-- Richard

On 04.12.2013, at 12:31, Jens Grivolla j+...@grivolla.net wrote:


Hi, we're now starting the EUMSSI project, which deals with integrating 
annotation layers coming from audio, video and text analysis.

We're thinking to base it all on UIMA, having different views with separate 
audio, video, transcribed text, etc. sofas.  In order to align the different 
views we need to have a common offset specification that allows us to map e.g. 
character offsets to the corresponding timestamps.

In order to avoid float timestamps (which would mean we can't derive from 
Annotation) I was thinking of using audio/video frames with e.g. 100 or 1000 
frames/second.  Annotation has begin and end defined as signed 32 bit ints, 
leaving sufficient room for very long documents even at 1000 fps, so I don't 
think we're going to run into any limits there.  Is there anything that could 
become problematic when working with offsets that are probably quite a bit 
larger than what is typically found with character offsets?

Also, can I have several indexes on the same annotations in order to work with 
character offsets for text analysis, but then efficiently query for overlapping 
annotations from other views based on frame offsets?

Btw, if you're interested in the project we have a writeup (condensed from the 
project proposal) here: 
https://dl.dropboxusercontent.com/u/4169273/UIMA_EUMSSI.pdf and there will 
hopefully soon be some content on http://eumssi.eu/

Thanks,
Jens







Re: uimaFIT: managing component configurations

2013-11-28 Thread Jens Grivolla
Hi, basically I'm looking for a way to manage engine descriptions. So 
far I'm using createEngineDescription(...) when building a pipeline.  If 
my component, i.e. in the case of uimaFIT the class that implements the 
AE, defines good default values that is very easy and concise.


However, I also have many cases where I have different 
descriptors/descriptions that use the same class, and sometimes I then 
again override some parameters.  I would like to manage those 
descriptions separately from the pipeline where they are used, e.g. with 
Maven.  I want to avoid copy and pasting common parameter configurations 
from one pipeline to the other.


So far I was using different XML descriptors, each packaged in a 
separate PEAR as an independent component, and would build my pipeline 
based on those.  With uimaFIT I haven't found a good way to do this.


Basically I would like to have inheritance at the description level. 
This could either be through Java inheritance (i.e. CountryMapper.class 
extends ConceptMapper.class and overrides default parameter values) or 
through a way to store EngineDescriptions and reuse them, without having 
to resort to XML files.


So I would in some place define countryMapper = 
createEngineDescription(ConceptMapper.class, parameters) and package 
that as a Maven artifact, and somewhere else use it to build a pipeline 
using createEngineDescription(countryMapper, additional_parameters).


My problem is that I don't think I can override the default values with 
Java inheritance, and don't have a good way to package 
EngineDescriptions.  I guess I could have a class with a static method 
that returns the engine description and package that, but it would be 
nice to have something more standard and elegant.


Thanks,
Jens

On 11/28/2013 12:37 AM, Richard Eckart de Castilho wrote:

Hi,

I'm not sure that I understand what you want to do. When you create a 
descriptor for a component e.g. using createEngineDescription(…), this 
descriptor is configured with the default values (unless you override them in 
the call to createEngineDescription).

You can change parameters on such a descriptor using 
ResourceCreationSpecifierFactory.setConfigurationParameters(…)

Does that help? Can you make a more vivid example of what you are trying to 
accomplish, maybe with a  bit if pseudo-code marking those places that remain 
unclear how to handle them?

Cheers,

-- Richard

On 27.11.2013, at 07:47, Jens Grivolla j+...@grivolla.net wrote:


Hi,

so far we were using PEARs to manage different configurations of components, 
e.g. having a CountryMapper, CityMapper, PersonMapper, etc., all based on 
ConceptMapper but with different settings/models.

How would I do that in uimaFIT? Basically I would like to create components 
that just override the default values for parameters/resources.

In some cases, parameters are additionally overridden at the pipeline level 
(CPE/uimaFIT), e.g. when using a database CasConsumer where we would have 
several base configurations (e.g. annotation to DB column mappings), but then 
override the DB connection settings in the pipeline.

Having the full configuration at the pipeline level makes it much more 
difficult to manage configurations, so I would like to be able to point to a 
given component and automatically get the correct default settings.

Thanks,
Jens







uimaFIT: managing component configurations

2013-11-27 Thread Jens Grivolla

Hi,

so far we were using PEARs to manage different configurations of 
components, e.g. having a CountryMapper, CityMapper, PersonMapper, etc., 
all based on ConceptMapper but with different settings/models.


How would I do that in uimaFIT? Basically I would like to create 
components that just override the default values for parameters/resources.


In some cases, parameters are additionally overridden at the pipeline 
level (CPE/uimaFIT), e.g. when using a database CasConsumer where we 
would have several base configurations (e.g. annotation to DB column 
mappings), but then override the DB connection settings in the pipeline.


Having the full configuration at the pipeline level makes it much more 
difficult to manage configurations, so I would like to be able to point 
to a given component and automatically get the correct default settings.


Thanks,
Jens



Re: uimaFIT: external resource bindings?

2013-10-24 Thread Jens Grivolla
Ok, I guess I don't actually need to do that, ConceptMapper only looks 
for the key and doesn't seem to know about the indirect binding, right?


And in uimaFIT if I want to bind the same resource to several AEs I use 
createExternalResourceDescription() and then just pass it like any other 
parameter to createEngineDescription()?


Bye,
Jens

On 10/24/2013 05:28 PM, Jens Grivolla wrote:

Hi, I'm trying to run ConceptMapper from uimaFIT, but
createDependencyAndBind doesn't seem to allow to separate declaring the
external resource (with a name) and binding that name to a key. I
looked through ExternalResourceFactory but didn't find any method that
seems to obviously do what I need. What should I do?

Btw, I updated ConceptMapper to be based on JCasAnnotator_ImplBase
instead of Annotator_ImplBase and TextAnnotator (both deprecated).

Bye,
Jens







next UIMA workshop?

2013-10-21 Thread Jens Grivolla
Hi, at GSCL 2013 we talked a bit about options for the next UIMA 
workshop. How about trying to have it at COLING 2014?


WORKSHOP TIMELINE
• 19th January 2014: Workshop proposals due
• 26th January 2014: Notification of workshop acceptances
• 18th July 2014: Camera-ready deadline for workshop proceedings
• 23rd and 24th August 2014: COLING Workshops

http://www.coling-2014.org/workshop-call.php

So that would be approximately one year after the GSCL workshop which 
would probably give enough time for people to have new things to 
present, and there are still 3 months before submitting the workshop 
proposal.


COLING is going to be in Dublin, which makes it relatively easy to 
attend for the European UIMA community.


What do you think?

Bye,
Jens



Re: Working with very large text documents

2013-10-18 Thread Jens Grivolla

On 10/18/2013 10:06 AM, Armin Wegner wrote:


What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that 
represent one document? From a quick look at project gutenberg it seems 
that a full book with HTML markup is about 500kB to 1MB, so that's about 
a complete public library full of books.


Bye,
Jens



Re: AW: Working with very large text documents

2013-10-18 Thread Jens Grivolla
Ok, but then log files are usually very easy to split since they 
normally consist of independent lines. So you could just have one 
document per day or whatever gets it down to a reasonable size, without 
the risk of breaking grammatical or semantic relationships.


On 10/18/2013 12:25 PM, Armin Wegner wrote:

Hi Jens,

It's a log file.

Cheers,
Armin

-Ursprüngliche Nachricht-
Von: Jens Grivolla [mailto:j+...@grivolla.net]
Gesendet: Freitag, 18. Oktober 2013 11:05
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 10/18/2013 10:06 AM, Armin Wegner wrote:


What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that represent one 
document? From a quick look at project gutenberg it seems that a full book with 
HTML markup is about 500kB to 1MB, so that's about a complete public library 
full of books.

Bye,
Jens






Re: uimafit maven plugin: type system imports?

2013-10-14 Thread Jens Grivolla
I gave up on integrating uimaFIT-based builds with PEAR packaging, there 
are fundamental differences that I don't know how to resolve cleanly, in 
particular:


uimaFIT: 1 maven artifact = N analysis engines = N generated descriptors

PEAR packaging maven plugin: 1 mvn artifact = 1 AE = 1 descriptor = 1 
generated PEAR


I don't think it's worth it to extend the PEAR packaging maven plugin to 
generate multiple PEARs, so we'll just stick with having PEAR packaging 
as something separate.


I'm actually thinking of separating components packaged as PEARs (as 
described by the XML descriptors) from analysis engines (the actual 
code) packaged as JARs, with separate namespaces. That's pretty much the 
separation we have right now, but without the separate namespaces. In 
that case it would be clear that a component is basically a packaged 
engine (with parameter settings, etc.).


I created UIMA-3346 (https://issues.apache.org/jira/browse/UIMA-3346) as 
for other descriptor based workflows it would still be very useful to 
have automatically generated descriptors that are ready to use with type 
system imports.


Bye,
Jens

On 10/08/2013 12:04 PM, Jens Grivolla wrote:

Hi, I'm still having some other problems in getting it to work well with
the pear packaging plugin (naming conventions, descriptor locations,
etc.), so I'm not sure if I can create a fully automated build.

It would still be nice to not have to edit the descriptor manually, but
since I have to do some manual steps anyway it's not as important to get
it fixed right now.

I'll create the feature request anyway, as it would be quite useful for
people using CPE, UIMA-AS or other descriptor-based deployments...

Bye,
Jens

On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote:

It is a known gap. I deliberately left this out of the current
version because the auto-detect mechanism (types.txt) may detect much
more than the component needs. Input/output capabilities are also not
a reliable source of information, in particular for components in
which types are configured via parameters.

I don't think it would be difficult to add. Please open a feature
request if you need this, along with a motivation. If you can spare
the time, patches are surely welcome. It would probably be good to
have this enabled by default, but allow to disable it.

Cheers,

-- Richard

On 04.10.2013, at 12:41, Jens Grivolla
j+...@grivolla.net wrote:


Hi,

I tried using the uimafit maven plugin, in particular the generate
goal (trying to make it play nice with the pear packaging plugin).
However, the generated descriptor does not include the type system
imports, even though they are specified through types.txt.

Is there some way to get those imports in the descriptor?

Thanks,
Jens












Re: uimafit maven plugin: type system imports?

2013-10-14 Thread Jens Grivolla
I don't think having more than one AE per PEAR would work, so the only 
solution would be to generate several PEARs from one project / maven 
module. This would introduce considerable additional complexity (it 
would have to discover all available components, etc.), and at least for 
us it's not worth it.


Don't worry about it, having PEAR packaging as something separate 
(possibly with modifications to the descriptor, etc.) and needing some 
manual steps to do it is no big deal. We might even move away from using 
PEARs and instead use uimaFIT based pipeline assembly for most of our 
work...


Thanks for your great work,
Jens

On 10/14/2013 11:55 AM, Richard Eckart de Castilho wrote:

It would be possible to have just one AE per Maven module so uimaFIT generates 
only one descriptor.

How do you imagine to handle it if a PEAR module contains more than one AE? How 
would the PEAR work?

-- Richard

On 14.10.2013, at 11:30, Jens Grivolla j+...@grivolla.net wrote:


I gave up on integrating uimaFIT-based builds with PEAR packaging, there are 
fundamental differences that I don't know how to resolve cleanly, in particular:

uimaFIT: 1 maven artifact = N analysis engines = N generated descriptors

PEAR packaging maven plugin: 1 mvn artifact = 1 AE = 1 descriptor = 1 generated 
PEAR

I don't think it's worth it to extend the PEAR packaging maven plugin to 
generate multiple PEARs, so we'll just stick with having PEAR packaging as 
something separate.

I'm actually thinking of separating components packaged as PEARs (as described by the 
XML descriptors) from analysis engines (the actual code) packaged as JARs, with 
separate namespaces. That's pretty much the separation we have right now, but without the separate 
namespaces. In that case it would be clear that a component is basically a packaged engine (with 
parameter settings, etc.).

I created UIMA-3346 (https://issues.apache.org/jira/browse/UIMA-3346) as for 
other descriptor based workflows it would still be very useful to have 
automatically generated descriptors that are ready to use with type system 
imports.

Bye,
Jens

On 10/08/2013 12:04 PM, Jens Grivolla wrote:

Hi, I'm still having some other problems in getting it to work well with
the pear packaging plugin (naming conventions, descriptor locations,
etc.), so I'm not sure if I can create a fully automated build.

It would still be nice to not have to edit the descriptor manually, but
since I have to do some manual steps anyway it's not as important to get
it fixed right now.

I'll create the feature request anyway, as it would be quite useful for
people using CPE, UIMA-AS or other descriptor-based deployments...

Bye,
Jens

On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote:

It is a known gap. I deliberately left this out of the current
version because the auto-detect mechanism (types.txt) may detect much
more than the component needs. Input/output capabilities are also not
a reliable source of information, in particular for components in
which types are configured via parameters.

I don't think it would be difficult to add. Please open a feature
request if you need this, along with a motivation. If you can spare
the time, patches are surely welcome. It would probably be good to
have this enabled by default, but allow to disable it.

Cheers,

-- Richard

On 04.10.2013, at 12:41, Jens Grivolla
j+...@grivolla.net wrote:


Hi,

I tried using the uimafit maven plugin, in particular the generate
goal (trying to make it play nice with the pear packaging plugin).
However, the generated descriptor does not include the type system
imports, even though they are specified through types.txt.

Is there some way to get those imports in the descriptor?

Thanks,
Jens


















Re: Designing collection readers: Reading multiple XML files containing multiple CASes

2013-10-10 Thread Jens Grivolla
It sounds to me like it would be much easier to just have a custom 
collection reader that outputs one CAS per document (i.e. multiple CASes 
per input file), rather than having a CR that outputs one CAS per file 
(with just metadata) plus an additional AE to generate the real CASes 
from there.


Do you have a specific reason for not simply writing a Collection Reader 
that does what you want?


Bye,
Jens

On 10/07/2013 03:19 AM, swirl wrote:

Hi,
I am wondering if anyone has a better idea.

Requirement:
a. I have a pipeline that needs to process a bunch of XML files.
b. The XML files could be on the disk, or from a remote location (available
via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
c. Each XML file contain mulitple sections, each section's content should be
parsed to produce a separate CAS
d. I need to able to parse XML of different schema. Although the assumption
is that each pipeline run can only handle one specific XML schema. That is, I
do not need to handle different XML schema in each pipeline run.
e. With the above, I need to be able to construct a new collection reader,
parser based on specific needs of each application.
f. For e.g., I can specify that the XML files are in a disk folder, and to
use parser A to decode the specific schema of the XML files. In another
pipeline, I can specify to the collection reader a list of URLs to retrieve
some remote XML files and parse them using parser B.

Here are what I have so far:
a. I am using cleartk's UriCollectionReader to insert URIs of files into the
CAS from local disk folders and remote URIs. So far so good.
b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
and parse the file according to XML schema A.
c. But the above only produce 1 CAS per XML file. Requirement c. is not
fulfilled. I need to produce multiple CASes from a single XML file. How do I
do this?

Thanks in advance.








Re: uimafit maven plugin: type system imports?

2013-10-08 Thread Jens Grivolla
Hi, I'm still having some other problems in getting it to work well with 
the pear packaging plugin (naming conventions, descriptor locations, 
etc.), so I'm not sure if I can create a fully automated build.


It would still be nice to not have to edit the descriptor manually, but 
since I have to do some manual steps anyway it's not as important to get 
it fixed right now.


I'll create the feature request anyway, as it would be quite useful for 
people using CPE, UIMA-AS or other descriptor-based deployments...


Bye,
Jens

On 10/04/2013 01:36 PM, Richard Eckart de Castilho wrote:

It is a known gap. I deliberately left this out of the current version 
because the auto-detect mechanism (types.txt) may detect much more than the component 
needs. Input/output capabilities are also not a reliable source of information, in 
particular for components in which types are configured via parameters.

I don't think it would be difficult to add. Please open a feature request if 
you need this, along with a motivation. If you can spare the time, patches are 
surely welcome. It would probably be good to have this enabled by default, but 
allow to disable it.

Cheers,

-- Richard

On 04.10.2013, at 12:41, Jens Grivolla j+...@grivolla.net wrote:


Hi,

I tried using the uimafit maven plugin, in particular the generate goal 
(trying to make it play nice with the pear packaging plugin). However, the generated 
descriptor does not include the type system imports, even though they are specified 
through types.txt.

Is there some way to get those imports in the descriptor?

Thanks,
Jens








uimafit maven plugin: type system imports?

2013-10-04 Thread Jens Grivolla

Hi,

I tried using the uimafit maven plugin, in particular the generate 
goal (trying to make it play nice with the pear packaging plugin). 
However, the generated descriptor does not include the type system 
imports, even though they are specified through types.txt.


Is there some way to get those imports in the descriptor?

Thanks,
Jens



Re: AW: Java level prerequsite upgrade?

2013-07-29 Thread Jens Grivolla

Same here, our own stuff relies on higher versions of Java anyway.

Jens

On 07/29/2013 07:55 AM, 
armin.weg...@bka.bund.de wrote:

No, not for me. You can even switch to Java 7.

Armin

-Ursprüngliche Nachricht-
Von: Marshall Schor [mailto:m...@schor.com]
Gesendet: Sonntag, 28. Juli 2013 16:05
An: uima-user
Betreff: Java level prerequsite upgrade?

Dear Users,

The UIMA developers would like to be able to start using Java 6 language 
features; of course this would require users to be running this level or later.

Currently, we require only Java 5 or later.

Java 5 from various vendors is either past end-of-life or approaching it 
(meaning no updates, unless you have some special contracts).

See http://www.oracle.com/technetwork/java/eol-135779.html or 
http://www.ibm.com/software/support/lifecycle/

If we started requiring Java 6 or later, would this be an issue for you?

-Marshall Schor






Building UIMA AEs with Gradle?

2013-05-28 Thread Jens Grivolla

Hi,

we are sometimes running into problems with Maven when we want to define 
tasks to move resources into specific locations, etc. This seems to 
often lead to having to use quite a few Maven plugins and makes the POM 
hard to manage.


Would Gradle be a better option, in order to have the dependency 
management from Maven while being able to more easily define custom 
manipulations of resources to help with packaging? Is it possible to 
generate PEAR packages from Gradle? There are afaik plugins for Maven 
and Ant, so would we then reference an Ant task from Gradle?


Thanks,
Jens



AE project structure

2013-05-28 Thread Jens Grivolla

Hi,

we currently (almost always) use the CPE to run our AEs (packaged as 
PEARs and then installed). However, we would like to start packaging our 
AEs differently to make it easier to also use them programatically, or 
e.g. include them in Solr using SolrUima. To do so we have started to 
modify some of our annotators so they load their resources from the 
classpath instead of using a file path and are getting closer to being 
able to package everything in JAR files.


However, the standard UIMA project structure puts things quite 
differently from a typical Maven layout, meaning that there's quite a 
bit of tweaking to make things fit with being both resolvable from the 
classpath and staying close to the UIMA structure. Should we just forget 
about uima.datapath and the /resource and /desc folders and put it all 
in /src/main/resources etc.? How compatible would that be with the 
PearPackagingMavenPlugin?


I think we will move to using UimaFit once it is released, but for some 
of the people here being able to have readily packaged PEAR files with 
descriptors that can be distributed is a big advantage that we don't 
want to give up.


Thanks,
Jens



managing resources for UIMA?

2013-05-22 Thread Jens Grivolla
Hi, while not strictly a UIMA issue, we have a problem that seems very 
relevant in the context of UIMA analysis engines: how to manage large 
binary resources such as trained models used by an AE, etc.


So far, we have managed to achieve a good separation between code 
development and the actual AEs, using Maven (and git for version 
control). An AE thus consists only of a POM referencing the code, the AE 
descriptor, and the resources used for the AE. The AE poms are 
configured to generate PEAR archives that include all dependencies and 
resources.


At this point we have the code in git, and the AEs' pom and descriptor 
also, while we manually copy the resources to the directory before 
running `mvn package` (and exclude those resources from git). We're 
missing a way to manage those resources, including versioning etc.


I'm guessing that this is a rather typical problem, so what solutions do 
you use? We're thinking of having all resources also in Maven (e.g. 
Artifactory) so we can reference them with a unique identifier and 
version. This would also help us when moving to more complex pipeline 
assemblies using uimafit instead of generating individual PEARS for each 
component in order to create complete packages.


Btw, we are just very few core developers, with most of the team made up 
of linguists, so we want to make it easy for them to save versions of 
resources they create and assemble AEs by just referencing the algorithm 
and resource (e.g. create a new OpenNLP POStagger using 
spanish-pos-model.bin, version 1.2.3).


Thanks for sharing your experiences with this...

Jens



Re: Does the UIMA pipeline support analysis components written as mahout map-reduce jobs

2013-02-15 Thread Jens Grivolla
What do you want to do? Map-reduce is batch processing, whereas a UIMA 
AE works online, so this doesn't really fit.


In Mahout map-reduce is usually used for training, not e.g. for applying 
a trained classifier. So you would train whichever way you want (e.g. 
using map-reduce, etc.), but your UIMA AE would actually be a wrapper 
for an online classifier, not a map-reduce task.


Best,
Jens

On 02/13/2013 11:47 PM, Som Satpathy wrote:

Hi all,

I have been toying around with UIMA pipelines for some time now. I was
wondering if UIMA can support analysis components written as mahout
map-reduce jobs as part of a UIMA pipeline ?

I would appreciate any help/hints/pointers.

Thanks,
Som






graphical flow configuration in UIMA-HPC?

2013-01-08 Thread Jens Grivolla

Hi,

the UIMA-HPC page contains a nice screenshot of what looks like a 
graphical tool for configuring UIMA flows. Is it (or anything like it) 
available to the public?


Thanks,
Jens



UIMA for multimodal annotation?

2013-01-07 Thread Jens Grivolla

Hi,

we're thinking of using UIMA for multimodal multimedia annotation (text, 
video, audio, ...), but have found little information of people actually 
doing that. I did find an old post by Burn Lewis about donating the GALE 
type system (Donation of a widely used type system for multi-modal text 
analysis) but not much more.


Thanks,
Jens



Re: Parallel CAS consumer

2012-10-10 Thread Jens Grivolla

Hi all,

from what I understand this does not involve CAS multipliers at all, but 
simply a flow where all CAS consumers are done in one parallel step.


Apparently this can't be done in a CPE so you would need an aggregate of 
all the CAS consumers, and have a parallel flow controller for that 
aggregate.


However, that wouldn't really do any good according to the 
documentation: ParallelStep, which specifies that multiple Analysis 
Engines should receive the CAS next, and that the relative order in 
which these Analysis Engines execute does not matter. Logically, they 
can run in parallel. The runtime is not obligated to actually execute 
them in parallel, however, and the current implementation will execute 
them serially in an arbitrary order.


Best,
Jens

On 10/10/2012 12:39 PM, Richard Eckart de Castilho wrote:

Hi,

I see. I think this is not possible. To my knowledge CPE (which you probably 
use) does not support CAS multipliers. I'm not too familiar with UIMA-AS, are 
you sure that it supports such a scenario?

If you manage to get realize the scenario as you described, it would be great 
to hear how you did it.

Best,

-- Richard

Am 10.10.2012 um 12:15 schrieb Timo Boehme timo.boe...@ontochem.com
:


Hi,

Am 10.10.2012 12:05, schrieb Richard Eckart de Castilho:

the main difference between CAS consumers and analysis engines is
that the former be default run only a single instance and the latter
can be multiplied. If your consumer code can be run in parallel, just
try inheriting from AnalysisEngine_ImplBase (or something like that)
instead.


Thanks for your answer. However each single consumer must run as single 
instance (e.g. one database consumer, one consumer writing to a file; each of 
them need to run as single instance). Thus I would like to have a single 
instance per consumer but the different consumer to run in parallel.


Kind regards,
Timo


Am 10.10.2012 um 12:00 schrieb Timo Boehme timo.boe...@ontochem.com
:


Hi,

is there any possibility without using UIMA-AS to run different CAS consumer 
components of a pipeline in parallel?
The standard behavior is that the consumer are called in sequence, but since in 
my case they don't depend on each other it would be more efficient to have them 
run in parallel. Can I use CAS multiplier + Flow control to achieve this?








Re: Clustering, Collapsing

2012-06-11 Thread Jens Grivolla

This sounds like you are actually looking for the project next door: Mahout.

UIMA really doesn't have a lot to do with clustering (although you could 
do some things). We do use UIMA for information extraction *before* 
clustering and sending it to Solr, though, as a sort of preprocessing to 
get relevant features from unstructured text. But it doesn't sound like 
that's what you're trying to do.


HTH,
Jens

On 06/08/2012 05:44 PM, Deejay wrote:

Hi all,

I recently discovered Apache UIMA, and it looks like a very large project! I
was hoping that someone more experienced with it than I could comment on
whether there are parts of the project that could help with my problem.

I need to go over many millions of objects (Protocol Buffers in HBase, as it
happens), and cluster them according to their similarity. Once each cluster is
formed, I need to 'collapse' each property of the objects to find the most
prevalent value. After this, the collapsed object will be added to a Solr
index.

Would any part of Apache UIMA be useful for the clustering or collapsing, or
have I misunderstood the nature of the project?







Re: Repackaging an unpackaged pear file

2012-04-26 Thread Jens Grivolla
We actually do that all the time, it works perfectly. Some archive 
managers even let you edit the file without unpacking it. You may need 
to rename it from .pear to .zip and back to .pear when you're done.


Jens

On 04/26/2012 06:10 PM, Marshall Schor wrote:

Thanks Thilo.

Could you unzip the pear with an unzipper, and do the change to fix the
file path and then zip it back up again? That way the variable
replacement stuff wouldn't run.

-Marshall

On 4/26/2012 5:07 AM, Thilo Goetz wrote:

On 25/04/12 23:20, Marshall Schor wrote:

I hope its trivial :-) (But I haven't tried it...).

It's not trivial, because the pear installer desctructively
replaces variables with local paths on installation. If
you don't know what you're doing, it will be much easier to
ask the other team to get you the original pear file.

There is no supported way to repackage an installed pear
file.

--Thilo


-Marshall

On 4/25/2012 1:15 PM, Mike O'Leary wrote:

I received a copy of an application that works with UIMA a few weeks
ago from
some colleagues at another location. When I followed the
instructions to
install it, I got an error message while unpacking a pear file, and it
looks
like an XML file within it contains some hard-coded pathnames to a
machine at
the organization that sent our colleagues the application originally.
I could
ask them to get in touch with the organization and ask them to
recreate the
pear file with relative pathnames so it can be installed on machines
on other
networks, and I probably will do that. But I was wondering how hard it
would be
just to correct the pathnames, re-package the pear file, and reinstall
that
one. I have never worked with UIMA before, so I am learning the basics
as I go.
How complicated would it be to create an Eclipse project using the
directory
structure that the pear file expanded to, or to run a command line
application
that creates a pear file from that directory structure?
Thanks,
Mike











Re: Unusable Document Analyzer because of too small font sizes

2012-03-27 Thread Jens Grivolla

On 03/25/2012 03:35 PM, Eric Buist wrote:

[UIMA chooses bad look-and-feel on some platforms]
Fortunately, I found a workaround: pass
-Dswing.systemlaf=com.sun.java.swing.plaf.gtk.GTKLookAndFeel in the JVM
arguments. That overrides the bad guess of the JVM and fallback to
Metal. Note that the usual property name would be swing.defaultlaf, but
I had to use swing.systemlaf because of the DocAnalyzer activating the
system look and feel (this would have normally worked, though).

If I could find a way to have this passed all the times, without having
to change the launch configuration of each UIMA tool, I would be
happier, but this is already a very good step forward. At least I am not
blocked anymore by this, and can continue exploring under Linux.


Pretty much all UIMA tools that you run from the command line are run 
through runUimaClass.sh, so whatever settings you make there should work 
pretty much universally.


HTH,
Jens



InlineXMLCasConsumer fails depending on locale

2012-02-21 Thread Jens Grivolla

Hi,

it appears that InlineXMLCasConsumer depends on the system locale for 
some internal transformations. The output appears to be written in UTF8 
(outStream.write(xmlAnnotations.getBytes(UTF-8));) but when used on a 
machine with a locale of ASCII all accented characters get broken.


I suspect that it has to do with the XMLSerializer working on a 
ByteArrayOutputStream, but haven't been able to track it down yet.


Any ideas?

Bye,
Jens



Re: InlineXMLCasConsumer fails depending on locale

2012-02-21 Thread Jens Grivolla

On 02/21/2012 04:08 PM, Thilo Goetz wrote:

On 21/02/12 15:59, Jens Grivolla wrote:

it appears that InlineXMLCasConsumer depends on the system locale for
some internal transformations. The output appears to be written in UTF8
(outStream.write(xmlAnnotations.getBytes(UTF-8));) but when used on a
machine with a locale of ASCII all accented characters get broken.

I suspect that it has to do with the XMLSerializer working on a
ByteArrayOutputStream, but haven't been able to track it down yet.


Have you checked that it's really the writing end where things
get corrupted, and not the reading end?  Just a thought...


Yes, we have an XmiWriterCasConsumer in parallel that works fine.

Jens



Re: UIMA Python integration?

2011-12-16 Thread Jens Grivolla

Hi Nicolas,

we haven't really made any progress. Right now we're using only Java 
within the UIMA pipeline (and one C++ annotator).


We then generate XMIs (or in some cases inline XML to get annotations 
aligned automatically) and work on that in Python, without a library and 
probably not even dealing with the XMI format entirely correctly. :-(


Anyway, things are not pretty, but we just don't have time to actually 
develop a better solution.


Bye,
Jens

On 12/16/2011 12:13 AM, Nicolas Hernandez wrote:

Hi

6 months later.

Jens what experience have you learn about UIMA and Python ? Is
Pythonnator still the simplest solution for working with XMI ?
No other alternative ?

Best

/Nicolas



On Wed, May 4, 2011 at 11:48 PM, Eddie Epsteineaepst...@gmail.com  wrote:

The last update with uimapy on Apache UIMA was that it had problems
deserializing somewhat complex XmiCas examples.

The previous problem with jython was that it was backlevel relative to
the needs of some python analytic code. Jython seems like the simplest
integration, assuming it works.

The Pythonnator requires a uimacpp runtime. More complicated, but
perhaps a much faster python execution environment? Uimacpp fully
supports XmiCas serialization methods.

Eddie

On Wed, May 4, 2011 at 4:42 AM, Jens Grivollaj+...@grivolla.net  wrote:

Hi,

what's the current status on combining UIMA and Python?

I know that it should be possible to write AEs in Python using either the
BSF Annotator (and jython) or Pythonnator (using SWIG).  I haven't tried
either one yet, so I'm open to recommendations on which to use.

I would also very much like to write UIMA (and especially UIMA AS) clients
in Python.  Is it possible at all to use an annotation pipeline from a
language other than Java?  We are currently using the simple REST server for
this, but it has serious limitations.

Lastly, and probably more simply, I would like to be able to work with XMI
files using Python.  There used to be uimapy by Ed Loper, but I can't find a
copy anywhere and the sourcefore repository is empty.  I found no mention on
the mailing list of what happened to the project and the discussion about
seems to just have ended quite abruptly.

Thanks for any suggestions or hints,

Jens









Re: Setting up third party libraries for UIMA AS application

2011-11-16 Thread Jens Grivolla

Hi,

that's basically what we are doing, too.

If the PEAR is configured correctly, the CLASSPATH and uima.datapath 
should appear in install.xml and setenv.txt, and you could use those to 
set your classpath in your executor.bat. You would then avoid having to 
define path_to_my_third_party_libraries separately.


Unfortunately, it is not always possible (in Linux) to just `source 
setenv.txt` to set the environment variables because it fails on the 
uima.datapath assignment (I believe it would work with UIMA_DATAPATH). 
So it is often still necessary to adapt the launch script depending on 
the component you are working with.I have no idea how it is in Windows.


If there are any better suggestions, I'd be interested also.

Bye,
Jens

On 11/16/2011 01:50 PM, Spico Florin wrote:

Hello, Jens!
   In order to solve the problem, I've created a bash script named
executor.bat where I've added the following lines:

@set UIMA_CLASSPATH=%UIMA_CLASSPATH%;path_to_my_third_party_libraries
deployAsyncService.cmdmy_deployment_descriptor_for_as.xml

I've put this script in a folder my_project/deploy/as/ and then I've set up
thepath_to_my_third_party_libraries
with the relatives paths to the lib and bin folders of the project:
i.e.
@set UIMA_CLASSPATH=%UIMA_CLASSPATH%;../../lib;../../bin

The structure of the project after installing the pear file will look like
this:
installed
I
I-uima-pipeline
 I-bin
 I-deploy
   I-as
   I-executor.bat (from here we will execute the script)
 I-lib
I-third_party_library.jar
 I-descriptors
 I-metadata
 I-resources

I don't know that the above is a solution, but it worked for me.
Therefore, I have the following question:
How can I use the  variables set up in install.xml and setenv.txt in my
executor.bat script?

I'll look forward for your answer,

Thank you.
Regards,
  Florin







On Tue, Nov 15, 2011 at 4:21 PM, Jens Grivollaj+...@grivolla.net  wrote:


On 11/15/2011 02:55 PM, Spico Florin wrote:


Hello!
   I have an UIMA AS application that is using third party libraries. I
would like to know the following:
1. Where (location) we can add these third  libraries such that the
deployed application to be aware of them and not throwing
ClassNotFoundException?
A brute force solution for me, was to add them directly in the UIMA AS
lib/ folder, but this solution was just for testing and is not
acceptable
in production.
2. How can be set up this third party libraries when generating PEAR file
in a such a way that deploying the application will consider them and
won't
be necessary to manually add them to the classpath?



UIMA AS doesn't directly support PEAR files. You will have to install the
pear and set the classpath when you deploy it to UIMA AS.

Where to put libraries so they will be correctly referenced in the PEAR
(i.e. they are included in the install.xml and setenv.txt) depends on how
you build the PEAR. You may need to include the libraries in your Eclipse
build path, or put them in a directory that your Maven configuration
includes when building the PEAR.

HTH,
Jens









Re: Setting up third party libraries for UIMA AS application

2011-11-15 Thread Jens Grivolla

On 11/15/2011 02:55 PM, Spico Florin wrote:

Hello!
   I have an UIMA AS application that is using third party libraries. I
would like to know the following:
1. Where (location) we can add these third  libraries such that the
deployed application to be aware of them and not throwing
ClassNotFoundException?
A brute force solution for me, was to add them directly in the UIMA AS
lib/ folder, but this solution was just for testing and is not acceptable
in production.
2. How can be set up this third party libraries when generating PEAR file
in a such a way that deploying the application will consider them and won't
be necessary to manually add them to the classpath?


UIMA AS doesn't directly support PEAR files. You will have to install 
the pear and set the classpath when you deploy it to UIMA AS.


Where to put libraries so they will be correctly referenced in the PEAR 
(i.e. they are included in the install.xml and setenv.txt) depends on 
how you build the PEAR. You may need to include the libraries in your 
Eclipse build path, or put them in a directory that your Maven 
configuration includes when building the PEAR.


HTH,
Jens



Re: Running Collection Processor Engine as Rest Web Service

2011-11-02 Thread Jens Grivolla
I'm not sure how you would want to expose that functionality.  Since 
input and output would be done through the API, those are basically your 
Reader and your Consumer. How would you expose other CollectionReaders 
and CasConsumers as a web service?


AAEs are obviously no problem, since they are nothing but a specific 
type of Analysis Engine.


Bye,
Jens

On 11/02/2011 04:27 PM, Spico Florin wrote:

Hello!
   I would like to know if it is possible in UIMA to run CPE as a Rest Web
Sevice? I've read that you can expose tha Analysis Engine (AE) . I'm not
sure if CASReader,CASConsumer ,Aggregate Analysis Engine, CPE can be also
exposed as REST Web Service. Can you please provide some example on how to
do this?
   I look forward for your answers.

Thank you.

Regards,
   Florin






Re: PEAR packaging and maven

2011-05-27 Thread Jens Grivolla

On 05/26/2011 08:37 PM, Greg Holmberg wrote:


[...] What I want may simply be outside the design target of PEAR files. My
expectations of PEAR files were based on how other archive formats in
Java work. JAR files, WAR files, etc. These can all be use in-place,
without any re-writing of their contents. You can just refer to them,
and the system can locate the things in them at run-time through
relative paths, regardless of what directory they've been dropped into.
In other words, there's no installation process for JAR files or WAR files.  
[...]


At least in Tomcat, WAR files actually get unzipped before use, and the 
UIMA SimpleRestServer also installs PEAR files on the fly.  Using 
SimpleRestServer you actually have one WAR file which contains a PEAR 
file, and when you deploy it in Tomcat, it automatically unzips the WAR 
and then (I believe on first call to the service) installs the PEAR. 
The descriptor used in the WAR references the PEAR file directly.


I believe that WARs only get installed when they haven't been installed 
yet, and I would hope the same is true for the PEAR installation, so 
there's no overhead from installing on each run of the pipeline.


It still rewrites $main_root, etc., but it gets pretty close to a 
transparent use of PEARs.  At least for UIMA-AS a similar deployment 
scheme would make sense, and one could substitute install pear, adjust 
classpath, deploy pointing to pear descriptor with simply deploy 
pointing to pear file which would be much nicer, as we could skip all 
the launch scripts we are currently creating.  For other uses one might 
have to think more about automatically cleaning up the unzipped 
directory, etc.


Bye,
Jens



Re: Cas Editor: group annotation types by namespace?

2011-05-10 Thread Jens Grivolla

On 05/10/2011 10:13 AM, Richard Eckart de Castilho wrote:
 [package names vs. type hierarchy]

For a technically-oriented user, the package names are probably
better. But for a linguist or knowledge-engineer, I am pretty sure
that the inheritance hierarchy is more interesting. One dives down to
the particular level one to which he can still make a distinction and
then stops.


Yes, I guess that depends how you use the inheritance.  In our case we 
go from the general UIMA Annotation to our own generic Annotation type 
that adds a few features, then the generic manual Annotation with 
features specific to human annotations, etc.  So the inheritance is 
purely technical and implies no semantic hierarchy, and it makes no 
sense at all to a human annotator to go through all those levels that 
are completely meaningless to them.



I think it would be good to offer both approaches, maybe on different
key-bindings and/or different sub menus reachable from the context
menu.


I agree that maintaining the old behaviour for your use case makes 
sense, so we would need either two menus or a project-wide preference.


Jens



Cas Editor: group annotation types by namespace?

2011-05-09 Thread Jens Grivolla

Hi,

I was wondering if it wouldn't be more useful to group annotation types 
in the mode and similar menus by namespace rather than inheritance.


I don't think most users care much about supertypes, and mostly don't 
know about them, whereas the namespace seems to me to be a more natural 
way to organize the menus.


I think a flat top level with all used namespaces would work quite well, 
and the submenu with the annotation type names would not need to include 
the prefix.  What do you think?


Jens



Re: Cas Editor: selecting annotation type

2011-05-06 Thread Jens Grivolla

On 05/05/2011 09:30 PM, Jörn Kottmann wrote:

On 5/5/11 6:09 PM, Jens Grivolla wrote:

On 05/05/2011 03:04 PM, Jörn Kottmann wrote:

On 5/5/11 2:41 PM, Jens Grivolla wrote:

At least on my system (Eclipse Helios on Ubuntu 10.10) the Shift+Enter
shortcut does not work, and will be treated as an unmodified Enter,
i.e. no selection list appears. I haven't tried yet on other systems
because I need to install the updated plugins first.


Ok, I will investigate that. But then this was not the system where you
experienced the hang issue in the 2.3.1 version?


As you said, the freeze was due to the shortcut creation when the type
system is too big, and it ocurred on all machines.

I sometimes have to press return twice to get a quick annotation,
too, and on a different machine (Eclipse Helios on Windows XP) it
worked even less, to the point that I had to use the context menu.


I opened a jira for the short cut issue and fixed it, would be nice if
you could
test. I believe the issue is related to a recently defined command and
key binding in the
plugin.xml. I also now did this for the quick type selection dialog
short cut.

Here is the jira:
https://issues.apache.org/jira/browse/UIMA-2139


Shift+Enter now seems to work reliably. Plain Enter works when I select 
a word via double click, but has problems when I select a text span (on 
my Linux machine). Shift+Enter works in that case, and plain Enter works 
after pressing Shift+Enter or just pressing any other key, eg. Shift or 
Ctrl.



On that machine the edit view was having problems, too, and I
usually had to click on the feature name before being able to activate
the feature value field. I haven't tried Shift+Enter on that machine.


Did you run the current trunk on that machine? If so would be nice if
you can give me further details
about the edit view issues. What type had the feature you clicked on?
Are there exceptions in the error log?


Yes, that was running trunk with yesterday morning's fixes. 
Unfortunately, I don't have access to that machine anymore and can't 
give you any more details at this point.  We do have some other Windows 
machines though, and I will look if I find anything in the error logs 
both on Linux and Windows.



[..] Which brings me to another thing that
would be interesting for us: having preset feature values filled in
automatically. We would be using that to automatically fill in the
annotator's name on all annotations created by them.


This you can easily do when you pre-process the files you pass to the
annoator, or post-process when he gives them back.


I've been thinking about that option. It would be quite easy at the
document level, but becomes a bit more complicated when each
annotation can come from a different annotator and files get passed
from one annotator to the next.


For one project I created a small plugin which just defined a view for
something similar.
Its actual not difficult to access the CAS and updates to it through the
Annotation Editor.


We're currently thinking of just post-processing the XMIs and adding the 
annotator name to all annotations (of the types of interest) that don't 
have a name set yet.  We'll look into doing something more sophisticated 
for the next round of annotations.


Thanks a lot for your help,
Jens



Re: Cas Editor: selecting annotation type

2011-05-05 Thread Jens Grivolla

On 05/04/2011 02:44 PM, Jörn Kottmann wrote:

On 5/4/11 2:33 PM, Jens Grivolla wrote:

How do I best update to the
trunk version?


You can either build the trunk version yourself or pick up a
distribution from our build server.


I've got a local build based on trunk.


I am not sure what is the best way to update, or what will happen if an
old and newer version is installed
into the same eclipse installation.
I would try to put the new eclipse plugins into the dropins folder, and
then see if they get loaded instead,
if not I suggest that you remove the plugins installed via Install new
software


It seems to have picked it up ok, but I'm getting errors when opening an 
XMI with the Annotation editor:


Caused by: org.eclipse.core.internal.resources.ResourceException: 
Resource '/OneOfMyClosedProjects' is not open.
	at 
org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137)

at 
org.eclipse.core.internal.resources.Project.hasNature(Project.java:511)
at 
org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90)

Apparently the migration from CasEditorProjects to the new way fails 
whenever there is a closed project in the workspace.  I don't have any 
projects that need to be migrated, but it tries to check every project I 
have and fails hard when it can't.  It would be good if somebody could 
verify that before filing a bug report.


Bye,
Jens



Re: Cas Editor: selecting annotation type

2011-05-05 Thread Jens Grivolla

On 05/05/2011 12:37 PM, Jens Grivolla wrote:

I'm getting errors when opening an
XMI with the Annotation editor:

Caused by: org.eclipse.core.internal.resources.ResourceException:
Resource '/OneOfMyClosedProjects' is not open.
at
org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137)

at org.eclipse.core.internal.resources.Project.hasNature(Project.java:511)
at org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90)

Apparently the migration from CasEditorProjects to the new way fails
whenever there is a closed project in the workspace. I don't have any
projects that need to be migrated, but it tries to check every project I
have and fails hard when it can't. It would be good if somebody could
verify that before filing a bug report.


It works fine after I removed the check.

- if ( project.hasNature(org.apache.uima.caseditor.NLPProject)) {
+ if (false) {

Bye,
Jens



Re: Cas Editor: selecting annotation type

2011-05-05 Thread Jens Grivolla

On 05/05/2011 12:59 PM, Jörn Kottmann wrote:

On 5/5/11 12:55 PM, Jens Grivolla wrote:

On 05/05/2011 12:37 PM, Jens Grivolla wrote:

I'm getting errors when opening an
XMI with the Annotation editor:

Caused by: org.eclipse.core.internal.resources.ResourceException:
Resource '/OneOfMyClosedProjects' is not open.
at
org.eclipse.core.internal.resources.Project.checkAccessible(Project.java:137)


at
org.eclipse.core.internal.resources.Project.hasNature(Project.java:511)
at
org.apache.uima.caseditor.CasEditorPlugin.start(CasEditorPlugin.java:90)

Apparently the migration from CasEditorProjects to the new way fails
whenever there is a closed project in the workspace. I don't have any
projects that need to be migrated, but it tries to check every project I
have and fails hard when it can't. It would be good if somebody could
verify that before filing a bug report.


It works fine after I removed the check.

- if ( project.hasNature(org.apache.uima.caseditor.NLPProject)) {
+ if (false) {


Yes, but that would disable to migration code.

The fix is now:
if (project.isOpen() 
project.hasNature(org.apache.uima.caseditor.NLPProject))


Of course, it was just to get it working as soon as possible.  I 
recompiled with your fix and reinstalled the plugins, and I see no problems.


Thanks,
Jens



Re: Cas Editor: selecting annotation type

2011-05-05 Thread Jens Grivolla

On 05/05/2011 03:04 PM, Jörn Kottmann wrote:

On 5/5/11 2:41 PM, Jens Grivolla wrote:

On 05/05/2011 01:55 PM, Jörn Kottmann wrote:

On 5/5/11 1:44 PM, Jörn Kottmann wrote:

That sounds like one more good reason to do that. Another one I thought
of is that it is confusing when you add an annotation which you cannot
see afterward.

So lets open a jira and do this enhancement.


Here is the jira:
https://issues.apache.org/jira/browse/UIMA-2137

Do you think this dialog fixes the problem you reported initially with
the editor annotation mode?


Yes, I think that would work quite well for us. One issue with setting
the shortcuts based on the full type system is that in our case at
hand some of the annotation types we need don't get assigned a shortcut.


Nice, I will try to fix this quickly for you.


Thanks, that's great.  I think that could be a significant time saver.


At least on my system (Eclipse Helios on Ubuntu 10.10) the Shift+Enter
shortcut does not work, and will be treated as an unmodified Enter,
i.e. no selection list appears. I haven't tried yet on other systems
because I need to install the updated plugins first.


Ok, I will investigate that. But then this was not the system where you
experienced the hang issue in the 2.3.1 version?


As you said, the freeze was due to the shortcut creation when the type 
system is too big, and it ocurred on all machines.


I sometimes have to press return twice to get a quick annotation, too, 
and on a different machine (Eclipse Helios on Windows XP) it worked even 
less, to the point that I had to use the context menu.  On that machine 
the edit view was having problems, too, and I usually had to click on 
the feature name before being able to activate the feature value field. 
 I haven't tried Shift+Enter on that machine.



I still think it would be nice to be able to change the mode from the
Outline view, but that feature would definitely have much lower
priority then.



Yes, I also believe that could be a good place to have it, please open a
jira issue for it.


done: https://issues.apache.org/jira/browse/UIMA-2138


Do you also need to fill in feature values for each created annotation?


Yes, for many of them we do. Which brings me to another thing that
would be interesting for us: having preset feature values filled in
automatically. We would be using that to automatically fill in the
annotator's name on all annotations created by them.


This you can easily do when you pre-process the files you pass to the
annoator, or post-process when he gives them back.


I've been thinking about that option.  It would be quite easy at the 
document level, but becomes a bit more complicated when each annotation 
can come from a different annotator and files get passed from one 
annotator to the next.



I believe we should start working here on tooling support for annotation
projects. There you typically have a collection of
documents which must be annotated by a team of annotators.


Yes, I think our situation is probably quite typical really.

Thanks,
Jens



UIMA Python integration?

2011-05-04 Thread Jens Grivolla

Hi,

what's the current status on combining UIMA and Python?

I know that it should be possible to write AEs in Python using either 
the BSF Annotator (and jython) or Pythonnator (using SWIG).  I haven't 
tried either one yet, so I'm open to recommendations on which to use.


I would also very much like to write UIMA (and especially UIMA AS) 
clients in Python.  Is it possible at all to use an annotation pipeline 
from a language other than Java?  We are currently using the simple REST 
server for this, but it has serious limitations.


Lastly, and probably more simply, I would like to be able to work with 
XMI files using Python.  There used to be uimapy by Ed Loper, but I 
can't find a copy anywhere and the sourcefore repository is empty.  I 
found no mention on the mailing list of what happened to the project and 
the discussion about seems to just have ended quite abruptly.


Thanks for any suggestions or hints,

Jens



Cas Editor: selecting annotation type

2011-05-04 Thread Jens Grivolla

Hi,

I have recently started using the Annotation Editor (as installed in 
Eclipse from http://www.apache.org/dist/uima/eclipse-update-site/, i.e. 
the official 2.3.1 version).


In order to add annotations it seems that you need to select the 
annotation type through the Mode context menu, which is quite time 
consuming (and error prone) if you have a large type system, and 
especially when the wanted type is derived through several levels of 
supertypes.  Given that you already select the types of interest through 
the Annotation Styles configuration, it would be much faster to e.g. 
select your annotation mode directly from the Outline view (which only 
contains your chosen subset).


It seems that there are quite a few changes in the trunk, but I'm not 
sure how to best use those versions, preferably without messing up my 
Eclipse configuration (which is a bit fragile when using manually 
installed plugins).


Thanks,
Jens



Re: Cas Editor: selecting annotation type

2011-05-04 Thread Jens Grivolla

On 05/04/2011 11:21 AM, Jörn Kottmann wrote:

On 5/4/11 11:10 AM, Jens Grivolla wrote:



In order to add annotations it seems that you need to select the
annotation type through the Mode context menu, which is quite time
consuming (and error prone) if you have a large type system, and
especially when the wanted type is derived through several levels of
supertypes.


You do not need to switch via the Mode context menu to add an annotation
of the desired type. The Mode type is just the type
you can annotate with the fewest key strokes. You can use Shift +
Enter to annotate a piece of text and then choose the
annotation type from a list of available types in a pop up. Each type in
this list is combined with a key short cut. When you remember
the short cut you can do something like this Shift + Enter + p to
create an annotation.
Where p is the letter written in front of one of your annotations.

Does that help you?


Unfortunately this consistently freezes Eclipse every time I have tried 
it, so I haven't even been able to see what it is supposed to do.  The 
keyboard shortcuts might help, if it worked.  We've tried it on several 
versions of Eclipse (all on Linux), and all freeze completely when 
pressing Shift-Return or clicking on the corresponding menu item.



I will have a look at the outline view, maybe we can add there a button
or context menu to switch the mode of the editor.


It seems that there are quite a few changes in the trunk, but I'm not
sure how to best use those versions, preferably without messing up my
Eclipse configuration (which is a bit fragile when using manually
installed plugins).


We fixed a few bugs and removed the Cas Editor Project support. I
suggest that you just create a normal eclipse project
and then place a type system at the default location.


That's what I'm doing, but with the 2.3.1 release installed via Install 
new software... in Eclipse.  How do I best update to the trunk version?


Thanks,
Jens



CR+LF = 1 character?

2011-04-20 Thread Jens Grivolla

Hi,

while working on the integration between UIMA and a different text 
annotation system we ran into problems with differing offsets between 
the two systems.


As it turns out, the other system considers CR+LF (Windows style line 
endings) to be two characters, while UIMA sees it as one.  Clearly, 
CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1, 
...) so all systems based on those encodings will see it as two 
characters, and I believe it is also represented as two Unicode characters.


In a way it makes sense to consider a newline as one character, 
independently of how it is represented, so I think the UIMA way is fine. 
 But is there an overview somewhere how different systems and 
programming language handle this, e.g. when extracting substrings, etc.?


Given the mess that this can be it's probably best to normalize all text 
at the beginning to only deal with Unicode strings with LF endings, 
encoded with UTF-8 when writing to disk or otherwise serializing the data.


It would still be interesting to know how painful this can get when not 
normalizing, and e.g. passing data between UIMA (Java), NLTK (Python), 
our own C#-based system, etc.


Thanks,
Jens



Re: status of uimacpp?

2011-04-06 Thread Jens Grivolla

Thanks Bhavani,

I think I will just stay with the 2.3.0-incubating uimacpp for now then.

Jens

On 04/05/2011 10:16 PM, Bhavani Iyer wrote:

Hi Jens

The 2.3.0-incubating uimacpp will work with the 2.3.1 releases of the
uimaj and uima-as.  It should work with the ActiveMQ broker 5.4.1
included in the 2.3.1 release of uima-as.

We are working on a new release of UIMACPP.  The ActiveMQ service
deployment wrapper was migrated to ActiveMQ 3.2 in order to support
failover protocol - https://issues.apache.org/jira/browse/UIMA-1925.
This also requires moving to APR version 1.3.x.

In addition, there are changes to the build process on Linux as described here:
https://issues.apache.org/jira/browse/UIMA-2053.

Regards,
Bhavani

On 4/5/11, Jens Grivollaj+...@grivolla.net  wrote:

Hi,

what's the current status of UIMA-CPP?  While uimaj and uima-as have
been released as 2.3.1, uimacpp hasn't and I haven't read of any plans
to release 2.3.1 so far.

Does 2.3.0-incubating uimacpp work with the 2.3.1 versions of uimaj and
uima-as, or should I better build it from trunk?  How about mixing 2.3.0
AS components with a 2.3.1 broker, etc.?

Thanks,
Jens









status of uimacpp?

2011-04-05 Thread Jens Grivolla

Hi,

what's the current status of UIMA-CPP?  While uimaj and uima-as have 
been released as 2.3.1, uimacpp hasn't and I haven't read of any plans 
to release 2.3.1 so far.


Does 2.3.0-incubating uimacpp work with the 2.3.1 versions of uimaj and 
uima-as, or should I better build it from trunk?  How about mixing 2.3.0 
AS components with a 2.3.1 broker, etc.?


Thanks,
Jens



runPearMerger on already merged PEARs

2010-12-14 Thread Jens Grivolla
It seems that runPearMerger.sh does not correctly adjust the paths when 
the input PEARs are already a merge.


On first run `runPearMerger.sh ae1.pear ae2.pear -n ae12` the paths to 
resources get adjusted from $main_root/X to $main_root/ae1/X or 
$main_root/ae2/X respectively.


However, on subsequent `runPearMerger.sh ae12.pear ae3.pear -n ae123` 
only the top level paths appear to get adjusted.


Should I file an issue, or is that a known and accepted limitation of 
PearMerger?


Thanks,
Jens