>>  do you have an issue with the parser and your html training data ?

Sorry, misread this the first time, I do not have any issue with the parser.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 09:55, Paul Cowan <[email protected]> wrote:

> >> Maybe you can contribute
>  a small sample of your training data to the project so we can
> add a junit test.
>
> I will gladly do that.  how is the best way to do that?  I believe the
> source control is moving.
>
> Is git an option or mercurial?  Pull requests are great for this type of
> thing through github or the mercurial equivalent.  I will make the model
> available for HTML parsing when it is finished also.
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 13 January 2011 09:32, Jörn Kottmann <[email protected]> wrote:
>
>> On 1/7/11 7:51 AM, Paul Cowan wrote:
>>
>>> I've had a dig at the source to try and answer my own questions
>>>
>>> Could somebody confirm that this is the correct way to delimit multiple
>>> documents for training?
>>>
>>> An example of which would be:
>>>
>>> <html><body><p>example</p><h2>  <START:organization>Orgainization
>>> One<END>
>>> </h2>  ........</body></html>
>>>
>>> <html><body><p>example</p><h2>  <START:organization>Orgainization
>>> Two<END>
>>> </h2>  ........</body></html>
>>>
>>> That is, I have a blank line between each document which will ensure that
>>> clearAdaptiveData is called?
>>>
>>>  Yes, a blank line indicates a new document in the training data, which
>> clears
>> all adaptive data on the feature generators. The only adaptive feature
>> generator
>> is currently the previous map feature generator.
>>
>>  My code to train the model is:
>>>
>>>  @Test
>>>     public void testHtmlOrganizationFind() throws Exception{
>>>         InputStream in = getClass().getClassLoader().getResourceAsStream(
>>>         "opennlp/tools/namefind/htmlbasic.train");
>>>
>>>         ObjectStream<NameSample>  sampleStream = new
>>> NameSampleDataStream(
>>>                 new PlainTextByLineStream(new InputStreamReader(in))
>>>         );
>>>
>>>         TokenNameFinderModel nameFinderModel;
>>>
>>>         nameFinderModel = NameFinderME.train("en", "organization",
>>>                 sampleStream, Collections.<String, Object>emptyMap());
>>>
>>>         try{
>>>             sampleStream.close();
>>>         }
>>>         catch (IOException ioe){
>>>         }
>>>
>>>         File modelOutFile = new
>>>
>>> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
>>>
>>>         if(modelOutFile.exists()){
>>>             try{
>>>                 modelOutFile.delete();
>>>             }
>>>             catch (Exception ex){
>>>             }
>>>         }
>>>
>>>         OutputStream modelOut = null;
>>>
>>>         try{
>>>             modelOut = new BufferedOutputStream(new
>>> FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
>>>             nameFinderModel.serialize(modelOut);
>>>         }catch (IOException ioe){
>>>             System.err.println("failed");
>>>             System.err.println("Error during writing model file: " +
>>> ioe.getMessage());
>>>         }finally {
>>>             if(modelOut != null){
>>>                 try{
>>>                     modelOut.close();
>>>                 }catch(IOException ioe){
>>>                   System.err.println("Failed to properly close model
>>> file: " +
>>>                       ioe.getMessage());
>>>                 }
>>>             }
>>>         }
>>>
>>>         assert(modelOutFile.exists());
>>>     }
>>>
>>>
>>>
>> Your training code looks good to me, do you have an issue with
>> the parser and your html training data ? Maybe you can contribute
>>  a small sample of your training data to the project so we can
>> add a junit test.
>>
>> Jörn
>>
>
>

Reply via email to