2011/6/23 Jörn Kottmann <[email protected]>

> On 6/22/11 7:53 PM, Olivier Grisel wrote:
>
>> We can also fix by having an option to delete "garbage" texts from the
>>> corpus.
>>>
>> Yes, discarding a whole CAS. But if the CAS is document level instead
>> of sentence level, that might be an issue.
>>
>>
> It depends, if the whole article is in such a bad condition that annotating
> it does not
> make sense it should be discarded. If only a small part of the article
> cannot be annotated,
> the annotator can skip over this part.
>
>  What other kind of data do you think we should store outside the CAses?
>>>
>> If we ignore the Sofa editing use case, probably nothing.
>>
>>  +1, to do that for now.
>
>
>  Also do you know of a good database for storing CAS? For instance does
>>>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>>>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>>>> Derby for instance?
>>>>
>>> I did a couple of tests with HBase and it was very easy to store 100M of
>>> CASes,
>>> anyway we do not really need to scale to that huge amounts, so I believe
>>> an
>>> NoSQL or relational database would be just fine.
>>>
>> I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
>> we target human annotators, we won't have terabytes of text data
>> anyway and all data will probably fit in memory in most cases. I was
>> thinking about using a DB to be able to handle concurrent editing by
>> several annotators (+ ability to do search in the Sofa content) in a
>> simple way.
>>
>
> Yeah, it does not seem important which DB we use, since most will
> just work well for us.
>
> I believe concurrent editing is more a question of the data model we choose
> and to support search I would use something Lucene based instead of the
> features
> some DBs might have.
>
> For training it is also important that we can iterate
> over all items in a reasonable time.
>
> I actually like BigTables Column Family model because
> it is easy to store a sofa plus feature structures in the columns,
> iterating
> is fast and it can be scaled to huge amounts of data if needed.
>
> Anyway, maybe it would be good to start with derby and just store XMI files
> in
> it, what do you think?
>

+1 at the moment it's not that important what storage solution to use, as it
can be improved once the basic functionalities are finished.
I also imagine such a system with one CAS per doc with sentence level
annotations.
Tommaso


>
> Jörn
>

Reply via email to