On 6/22/11 7:53 PM, Olivier Grisel wrote:
We can also fix by having an option to delete "garbage" texts from the
corpus.
Yes, discarding a whole CAS. But if the CAS is document level instead
of sentence level, that might be an issue.
It depends, if the whole article is in such a bad condition that
annotating it does not
make sense it should be discarded. If only a small part of the article
cannot be annotated,
the annotator can skip over this part.
What other kind of data do you think we should store outside the CAses?
If we ignore the Sofa editing use case, probably nothing.
+1, to do that for now.
Also do you know of a good database for storing CAS? For instance does
there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
a JDCB CASConsumer + CollectionReader that we could use with Apache
Derby for instance?
I did a couple of tests with HBase and it was very easy to store 100M of
CASes,
anyway we do not really need to scale to that huge amounts, so I believe an
NoSQL or relational database would be just fine.
I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
we target human annotators, we won't have terabytes of text data
anyway and all data will probably fit in memory in most cases. I was
thinking about using a DB to be able to handle concurrent editing by
several annotators (+ ability to do search in the Sofa content) in a
simple way.
Yeah, it does not seem important which DB we use, since most will
just work well for us.
I believe concurrent editing is more a question of the data model we choose
and to support search I would use something Lucene based instead of the
features
some DBs might have.
For training it is also important that we can iterate
over all items in a reasonable time.
I actually like BigTables Column Family model because
it is easy to store a sofa plus feature structures in the columns, iterating
is fast and it can be scaled to huge amounts of data if needed.
Anyway, maybe it would be good to start with derby and just store XMI
files in
it, what do you think?
Jörn