I've had a dig at the source to try and answer my own questions
Could somebody confirm that this is the correct way to delimit multiple
documents for training?
An example of which would be:
<html><body><p>example</p><h2> <START:organization>Orgainization One <END>
</h2> ........ </body></html>
<html><body><p>example</p><h2> <START:organization>Orgainization Two <END>
</h2> ........ </body></html>
That is, I have a blank line between each document which will ensure that
clearAdaptiveData is called?
My code to train the model is:
@Test
public void testHtmlOrganizationFind() throws Exception{
InputStream in = getClass().getClassLoader().getResourceAsStream(
"opennlp/tools/namefind/htmlbasic.train");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
new PlainTextByLineStream(new InputStreamReader(in))
);
TokenNameFinderModel nameFinderModel;
nameFinderModel = NameFinderME.train("en", "organization",
sampleStream, Collections.<String, Object>emptyMap());
try{
sampleStream.close();
}
catch (IOException ioe){
}
File modelOutFile = new
File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
if(modelOutFile.exists()){
try{
modelOutFile.delete();
}
catch (Exception ex){
}
}
OutputStream modelOut = null;
try{
modelOut = new BufferedOutputStream(new
FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
nameFinderModel.serialize(modelOut);
}catch (IOException ioe){
System.err.println("failed");
System.err.println("Error during writing model file: " +
ioe.getMessage());
}finally {
if(modelOut != null){
try{
modelOut.close();
}catch(IOException ioe){
System.err.println("Failed to properly close model file: " +
ioe.getMessage());
}
}
}
assert(modelOutFile.exists());
}
Cheers
Paul Cowan
Cutting-Edge Solutions (Scotland)
http://thesoftwaresimpleton.blogspot.com/
On 5 January 2011 20:32, Paul Cowan <[email protected]> wrote:
> Hi,
>
> I have created a sample file which for now is only marked up
> <START:Organization>..<END> markers and I have the following test which is
> passing. Java is not a language I have spent an awful lot of time on so
> forgive any ignorance on my part:
>
> @Test
> public void testHtmlOrganizationFind() throws Exception{
> InputStream in = getClass().getClassLoader().getResourceAsStream(
> "opennlp/tools/namefind/htmlbasic.train");
>
> ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
> new PlainTextByLineStream(new InputStreamReader(in))
> );
>
> TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
> "organization",
> sampleStream, Collections.<String, Object>emptyMap(), 70,
> 1);
>
> assertNotNull(nameFinderModel);
> }
>
> At the moment, I am preprocessing the htmlbasic.train file by stripping out
> all the new line characters so that it is just one line.
>
> I would be grateful if anyone could help me with the following questions:
>
> 1. Is the "type" argument passed into NameFinderME.train method the type
> of the model which in my case is organization (<START:organization>)? If
> so, would I need to call train for each tag I mark up the text with? I want
> to use <START:location> and others for example.
>
> 2. How do I feed multiple files into the training? Somebody said I could
> use the <HTML> tags as document delimiters. Or is another way to merge all
> the documents into 1 file which are delimited by the new line character? I
> cannot find a test which shows how to do this.
>
> Thanks
>
> Paul
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 23 December 2010 16:42, Benson Margulies <[email protected]> wrote:
>
>> If I were you, I'd keep HTML digestion separate from sentence bounding.
>>
>>
>> On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <[email protected]> wrote:
>> > Hi,
>> >
>> > Am I right in saying that, I will also need to create and train my own
>> HTML
>> > sentence detector in order to parse the HTML into chunks that can be
>> > tokenised?
>> >
>> > Cheers
>> >
>> > Paul Cowan
>> >
>> > Cutting-Edge Solutions (Scotland)
>> >
>> > http://thesoftwaresimpleton.blogspot.com/
>> >
>> >
>> >
>> > On 17 December 2010 15:10, Jörn Kottmann <[email protected]> wrote:
>> >
>> >> On 12/17/10 2:19 PM, James Kosin wrote:
>> >>
>> >>> I have the following questions that I would appreciate an answer for:
>> >>> >
>> >>> > 1. Can I have the different name finding tags in the same data?
>> >>>
>> >>
>> >> Yes, but that means you train a model which can detect each of these
>> >> names. You should test both, multiple name types in one model,
>> >> and separate models for each name type. You can use the built
>> >> in evaluation to validate your results.
>> >>
>> >> > 2. Does the<START:address> <END> make sense over multiple lines
>> or
>> >>> should I
>> >>> > break this up further?
>> >>>
>> >> No not possible, names spanning multiple sentences (a line is a
>> sentence),
>> >> is not supported.
>> >>
>> >>
>> >> > 3. I want to use 200 or 300 different examples, do I need to create
>> >>> separate
>> >>> > files for each example or can I merge them all into 1 and if it is
>> only
>> >>> 1,
>> >>> > do I need to mark up the start and end of a file?
>> >>>
>> >> If you want to use the command line training tool they must be all in
>> one
>> >> file, if you use the API
>> >> its up to you to merge these different sources into one name sample
>> stream.
>> >>
>> >> Jörn
>> >>
>> >
>>
>
>