2011/6/22 Jörn Kottmann <[email protected]>:
> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>
>> I am ok with switching to UIMA CAS. We might need additional metadata
>> outside of the CAS annotations though. For instance if the annotators
>> fixes a typo in the Sofa it-self, we might need to be able to tell
>> that Sofa1 is subject to being replaced by Sofa2 according to
>> annotator A1 for instance.
>>
>
> I am not sure if we should fix such mistakes, the system will also encounter
> them in real data it needs to process. Fixing typos, or correcting things in
> the text is
> always difficult when there are already existing annotations.
>
> Do you feel fixing mistakes in the text is important?

We can leave that issue as a low priority discussion for later and
just ignore it for now.


> We can also fix by having an option to delete "garbage" texts from the
> corpus.

Yes, discarding a whole CAS. But if the CAS is document level instead
of sentence level, that might be an issue.

> What other kind of data do you think we should store outside the CAses?

If we ignore the Sofa editing use case, probably nothing.

>> Also do you know of a good database for storing CAS? For instance does
>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>> Derby for instance?
>
> I did a couple of tests with HBase and it was very easy to store 100M of
> CASes,
> anyway we do not really need to scale to that huge amounts, so I believe an
> NoSQL or relational database would be just fine.

I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
we target human annotators, we won't have terabytes of text data
anyway and all data will probably fit in memory in most cases. I was
thinking about using a DB to be able to handle concurrent editing by
several annotators (+ ability to do search in the Sofa content) in a
simple way.

> To get started I believe we should just store a CAS as XMI and in a later
> stage
> we can work on optimizing the CAS storage to our needs and maybe even work
> together with the UIMA team on a more general corpus server, I know several
> people who have interest in this.

Alright. Let's use plain XMI files parsed and loaded in memory at the
beginning of annotation session.

> I believe the Corpus server should be independent of the other components
> and define some kind of remote API for data interchange.

Is there a JSON version of XMI? Hannes, what is your opinion on this?

> If we define such an API the actual storage system can be interchange easily
> at a later point in time.

Ok.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to