>> Maybe you can contribute
 a small sample of your training data to the project so we can
add a junit test.

I will gladly do that.  how is the best way to do that?  I believe the
source control is moving.

Is git an option or mercurial?  Pull requests are great for this type of
thing through github or the mercurial equivalent.  I will make the model
available for HTML parsing when it is finished also.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 09:32, Jörn Kottmann <[email protected]> wrote:

> On 1/7/11 7:51 AM, Paul Cowan wrote:
>
>> I've had a dig at the source to try and answer my own questions
>>
>> Could somebody confirm that this is the correct way to delimit multiple
>> documents for training?
>>
>> An example of which would be:
>>
>> <html><body><p>example</p><h2>  <START:organization>Orgainization One<END>
>> </h2>  ........</body></html>
>>
>> <html><body><p>example</p><h2>  <START:organization>Orgainization Two<END>
>> </h2>  ........</body></html>
>>
>> That is, I have a blank line between each document which will ensure that
>> clearAdaptiveData is called?
>>
>>  Yes, a blank line indicates a new document in the training data, which
> clears
> all adaptive data on the feature generators. The only adaptive feature
> generator
> is currently the previous map feature generator.
>
>  My code to train the model is:
>>
>>  @Test
>>     public void testHtmlOrganizationFind() throws Exception{
>>         InputStream in = getClass().getClassLoader().getResourceAsStream(
>>         "opennlp/tools/namefind/htmlbasic.train");
>>
>>         ObjectStream<NameSample>  sampleStream = new NameSampleDataStream(
>>                 new PlainTextByLineStream(new InputStreamReader(in))
>>         );
>>
>>         TokenNameFinderModel nameFinderModel;
>>
>>         nameFinderModel = NameFinderME.train("en", "organization",
>>                 sampleStream, Collections.<String, Object>emptyMap());
>>
>>         try{
>>             sampleStream.close();
>>         }
>>         catch (IOException ioe){
>>         }
>>
>>         File modelOutFile = new
>>
>> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
>>
>>         if(modelOutFile.exists()){
>>             try{
>>                 modelOutFile.delete();
>>             }
>>             catch (Exception ex){
>>             }
>>         }
>>
>>         OutputStream modelOut = null;
>>
>>         try{
>>             modelOut = new BufferedOutputStream(new
>> FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
>>             nameFinderModel.serialize(modelOut);
>>         }catch (IOException ioe){
>>             System.err.println("failed");
>>             System.err.println("Error during writing model file: " +
>> ioe.getMessage());
>>         }finally {
>>             if(modelOut != null){
>>                 try{
>>                     modelOut.close();
>>                 }catch(IOException ioe){
>>                   System.err.println("Failed to properly close model file:
>> " +
>>                       ioe.getMessage());
>>                 }
>>             }
>>         }
>>
>>         assert(modelOutFile.exists());
>>     }
>>
>>
>>
> Your training code looks good to me, do you have an issue with
> the parser and your html training data ? Maybe you can contribute
>  a small sample of your training data to the project so we can
> add a junit test.
>
> Jörn
>

Reply via email to