On 1/7/11 7:51 AM, Paul Cowan wrote:
I've had a dig at the source to try and answer my own questions

Could somebody confirm that this is the correct way to delimit multiple
documents for training?

An example of which would be:

<html><body><p>example</p><h2>  <START:organization>Orgainization One<END>
</h2>  ........</body></html>

<html><body><p>example</p><h2>  <START:organization>Orgainization Two<END>
</h2>  ........</body></html>

That is, I have a blank line between each document which will ensure that
clearAdaptiveData is called?

Yes, a blank line indicates a new document in the training data, which clears all adaptive data on the feature generators. The only adaptive feature generator
is currently the previous map feature generator.
My code to train the model is:

  @Test
     public void testHtmlOrganizationFind() throws Exception{
         InputStream in = getClass().getClassLoader().getResourceAsStream(
         "opennlp/tools/namefind/htmlbasic.train");

         ObjectStream<NameSample>  sampleStream = new NameSampleDataStream(
                 new PlainTextByLineStream(new InputStreamReader(in))
         );

         TokenNameFinderModel nameFinderModel;

         nameFinderModel = NameFinderME.train("en", "organization",
                 sampleStream, Collections.<String, Object>emptyMap());

         try{
             sampleStream.close();
         }
         catch (IOException ioe){
         }

         File modelOutFile = new
File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");

         if(modelOutFile.exists()){
             try{
                 modelOutFile.delete();
             }
             catch (Exception ex){
             }
         }

         OutputStream modelOut = null;

         try{
             modelOut = new BufferedOutputStream(new
FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
             nameFinderModel.serialize(modelOut);
         }catch (IOException ioe){
             System.err.println("failed");
             System.err.println("Error during writing model file: " +
ioe.getMessage());
         }finally {
             if(modelOut != null){
                 try{
                     modelOut.close();
                 }catch(IOException ioe){
                   System.err.println("Failed to properly close model file: " +
                       ioe.getMessage());
                 }
             }
         }

         assert(modelOutFile.exists());
     }



Your training code looks good to me, do you have an issue with
the parser and your html training data ? Maybe you can contribute
 a small sample of your training data to the project so we can
add a junit test.

Jörn

Reply via email to