2011/6/23 Jörn Kottmann <[email protected]> > On 6/22/11 7:53 PM, Olivier Grisel wrote: > >> We can also fix by having an option to delete "garbage" texts from the >>> corpus. >>> >> Yes, discarding a whole CAS. But if the CAS is document level instead >> of sentence level, that might be an issue. >> >> > It depends, if the whole article is in such a bad condition that annotating > it does not > make sense it should be discarded. If only a small part of the article > cannot be annotated, > the annotator can skip over this part. > > What other kind of data do you think we should store outside the CAses? >>> >> If we ignore the Sofa editing use case, probably nothing. >> >> +1, to do that for now. > > > Also do you know of a good database for storing CAS? For instance does >>>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe >>>> a JDCB CASConsumer + CollectionReader that we could use with Apache >>>> Derby for instance? >>>> >>> I did a couple of tests with HBase and it was very easy to store 100M of >>> CASes, >>> anyway we do not really need to scale to that huge amounts, so I believe >>> an >>> NoSQL or relational database would be just fine. >>> >> I am -1 for HBase as it requires to setup a Hadoop cluster to run. As >> we target human annotators, we won't have terabytes of text data >> anyway and all data will probably fit in memory in most cases. I was >> thinking about using a DB to be able to handle concurrent editing by >> several annotators (+ ability to do search in the Sofa content) in a >> simple way. >> > > Yeah, it does not seem important which DB we use, since most will > just work well for us. > > I believe concurrent editing is more a question of the data model we choose > and to support search I would use something Lucene based instead of the > features > some DBs might have. > > For training it is also important that we can iterate > over all items in a reasonable time. > > I actually like BigTables Column Family model because > it is easy to store a sofa plus feature structures in the columns, > iterating > is fast and it can be scaled to huge amounts of data if needed. > > Anyway, maybe it would be good to start with derby and just store XMI files > in > it, what do you think? >
+1 at the moment it's not that important what storage solution to use, as it can be improved once the basic functionalities are finished. I also imagine such a system with one CAS per doc with sentence level annotations. Tommaso > > Jörn >
