Hello all,

I am putting this question on its own thread not to get lost.

Question is about the proper usage of DefaultModelBuilderUtil.

I have not figured out the proper format of the files. Here' s what I think
from what I have been reading. Tell me if I am write.

>From class DefaultModelBuilderUtil method generateModel

@param sentences        a file that contains one sentence per line.
    *                                 There should be at least 15K sentences
    *                                 consisting of a representative sample
from
    *                                 user data

This seems to be a text file where each sentence is on one line.
I wonder if it has to be annotated, for instance:

<START:person> Archimedes <END> used the method of exhaustion to
approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
to estimate π rigorously .

Or just:

Archimedes used the method of exhaustion to approximate the value of
π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .


@param knownEntities            a file consisting of a simple list of
   *                                 unambiguous entities, one entry per
line.
   *                                 For instance, if one was trying to
build a
   *                                 person NER model then this file would
be a
   *                                 list of person names that are
unambiguous
   *                                 and are known to exist in the sentences

This would be a text file list?

Something like one name per line?

Archimedes
Socrates
....


* @param knownEntitiesBlacklist   This file contains a list of known bad
hits
   *                                 that the NER phase of this processing
might
   *                                 catch early one before the model
iterates
   *                                 to maturity

Same as the knownEntities but a list of what NOT to mark as an entity?


The rest seemed quite straight forward.

Thanks,

Reply via email to