>> Maybe you can contribute a small sample of your training data to the project so we can add a junit test.
I will gladly do that. how is the best way to do that? I believe the source control is moving. Is git an option or mercurial? Pull requests are great for this type of thing through github or the mercurial equivalent. I will make the model available for HTML parsing when it is finished also. Cheers Paul Cowan Cutting-Edge Solutions (Scotland) http://thesoftwaresimpleton.blogspot.com/ On 13 January 2011 09:32, Jörn Kottmann <[email protected]> wrote: > On 1/7/11 7:51 AM, Paul Cowan wrote: > >> I've had a dig at the source to try and answer my own questions >> >> Could somebody confirm that this is the correct way to delimit multiple >> documents for training? >> >> An example of which would be: >> >> <html><body><p>example</p><h2> <START:organization>Orgainization One<END> >> </h2> ........</body></html> >> >> <html><body><p>example</p><h2> <START:organization>Orgainization Two<END> >> </h2> ........</body></html> >> >> That is, I have a blank line between each document which will ensure that >> clearAdaptiveData is called? >> >> Yes, a blank line indicates a new document in the training data, which > clears > all adaptive data on the feature generators. The only adaptive feature > generator > is currently the previous map feature generator. > > My code to train the model is: >> >> @Test >> public void testHtmlOrganizationFind() throws Exception{ >> InputStream in = getClass().getClassLoader().getResourceAsStream( >> "opennlp/tools/namefind/htmlbasic.train"); >> >> ObjectStream<NameSample> sampleStream = new NameSampleDataStream( >> new PlainTextByLineStream(new InputStreamReader(in)) >> ); >> >> TokenNameFinderModel nameFinderModel; >> >> nameFinderModel = NameFinderME.train("en", "organization", >> sampleStream, Collections.<String, Object>emptyMap()); >> >> try{ >> sampleStream.close(); >> } >> catch (IOException ioe){ >> } >> >> File modelOutFile = new >> >> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin"); >> >> if(modelOutFile.exists()){ >> try{ >> modelOutFile.delete(); >> } >> catch (Exception ex){ >> } >> } >> >> OutputStream modelOut = null; >> >> try{ >> modelOut = new BufferedOutputStream(new >> FileOutputStream(modelOutFile), IO_BUFFER_SIZE); >> nameFinderModel.serialize(modelOut); >> }catch (IOException ioe){ >> System.err.println("failed"); >> System.err.println("Error during writing model file: " + >> ioe.getMessage()); >> }finally { >> if(modelOut != null){ >> try{ >> modelOut.close(); >> }catch(IOException ioe){ >> System.err.println("Failed to properly close model file: >> " + >> ioe.getMessage()); >> } >> } >> } >> >> assert(modelOutFile.exists()); >> } >> >> >> > Your training code looks good to me, do you have an issue with > the parser and your html training data ? Maybe you can contribute > a small sample of your training data to the project so we can > add a junit test. > > Jörn >
