>> do you have an issue with the parser and your html training data ?
Sorry, misread this the first time, I do not have any issue with the parser. Cheers Paul Cowan Cutting-Edge Solutions (Scotland) http://thesoftwaresimpleton.blogspot.com/ On 13 January 2011 09:55, Paul Cowan <[email protected]> wrote: > >> Maybe you can contribute > a small sample of your training data to the project so we can > add a junit test. > > I will gladly do that. how is the best way to do that? I believe the > source control is moving. > > Is git an option or mercurial? Pull requests are great for this type of > thing through github or the mercurial equivalent. I will make the model > available for HTML parsing when it is finished also. > > Cheers > > Paul Cowan > > Cutting-Edge Solutions (Scotland) > > http://thesoftwaresimpleton.blogspot.com/ > > > > On 13 January 2011 09:32, Jörn Kottmann <[email protected]> wrote: > >> On 1/7/11 7:51 AM, Paul Cowan wrote: >> >>> I've had a dig at the source to try and answer my own questions >>> >>> Could somebody confirm that this is the correct way to delimit multiple >>> documents for training? >>> >>> An example of which would be: >>> >>> <html><body><p>example</p><h2> <START:organization>Orgainization >>> One<END> >>> </h2> ........</body></html> >>> >>> <html><body><p>example</p><h2> <START:organization>Orgainization >>> Two<END> >>> </h2> ........</body></html> >>> >>> That is, I have a blank line between each document which will ensure that >>> clearAdaptiveData is called? >>> >>> Yes, a blank line indicates a new document in the training data, which >> clears >> all adaptive data on the feature generators. The only adaptive feature >> generator >> is currently the previous map feature generator. >> >> My code to train the model is: >>> >>> @Test >>> public void testHtmlOrganizationFind() throws Exception{ >>> InputStream in = getClass().getClassLoader().getResourceAsStream( >>> "opennlp/tools/namefind/htmlbasic.train"); >>> >>> ObjectStream<NameSample> sampleStream = new >>> NameSampleDataStream( >>> new PlainTextByLineStream(new InputStreamReader(in)) >>> ); >>> >>> TokenNameFinderModel nameFinderModel; >>> >>> nameFinderModel = NameFinderME.train("en", "organization", >>> sampleStream, Collections.<String, Object>emptyMap()); >>> >>> try{ >>> sampleStream.close(); >>> } >>> catch (IOException ioe){ >>> } >>> >>> File modelOutFile = new >>> >>> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin"); >>> >>> if(modelOutFile.exists()){ >>> try{ >>> modelOutFile.delete(); >>> } >>> catch (Exception ex){ >>> } >>> } >>> >>> OutputStream modelOut = null; >>> >>> try{ >>> modelOut = new BufferedOutputStream(new >>> FileOutputStream(modelOutFile), IO_BUFFER_SIZE); >>> nameFinderModel.serialize(modelOut); >>> }catch (IOException ioe){ >>> System.err.println("failed"); >>> System.err.println("Error during writing model file: " + >>> ioe.getMessage()); >>> }finally { >>> if(modelOut != null){ >>> try{ >>> modelOut.close(); >>> }catch(IOException ioe){ >>> System.err.println("Failed to properly close model >>> file: " + >>> ioe.getMessage()); >>> } >>> } >>> } >>> >>> assert(modelOutFile.exists()); >>> } >>> >>> >>> >> Your training code looks good to me, do you have an issue with >> the parser and your html training data ? Maybe you can contribute >> a small sample of your training data to the project so we can >> add a junit test. >> >> Jörn >> > >
