Re: [VOTE] Migrate our main repositories to GitHub

2017-06-27 Thread Mark G
+1

Sent from my iPhone

> On Jun 27, 2017, at 6:30 AM, Joern Kottmann  wrote:
> 
> +1
> 
> Jörn
> 
>> On Tue, Jun 27, 2017 at 12:30 PM, Joern Kottmann  wrote:
>> Hello all,
>> 
>> lets decide here if we want to move our main repository, currently
>> hosted at Apache to GitHub instead. This will make our process a bit
>> easier because we can eliminate one remote from our workflow.
>> 
>> [ ] +1 Migrate all repositories to GitHub
>> [ ] -1 Do not migrate,  because...
>> 
>> Thanks,
>> Jörn


Re: EntityLinker example

2016-12-06 Thread Mark G
there is a coreference resolution module in opennlp here
http://svn.apache.org/viewvc/opennlp/sandbox/opennlp-coref/
, I have never used it, but perhaps Joern might know how to get it working.

On Tue, Dec 6, 2016 at 7:07 AM, Damiano Porta 
wrote:

> Hmm, ok, so i must use an external tool for coreference.
>
> 2016-12-06 12:24 GMT+01:00 Mark G :
>
> > No , sorry, I don't think it would be any help for that.
> >
> > Sent from my iPhone
> >
> > > On Dec 6, 2016, at 5:57 AM, Damiano Porta 
> > wrote:
> > >
> > > Hello again Mark,
> > > pardon for the late reply.
> > >
> > > Basically i would like to link entities (different types) inside
> > sentences.
> > > I do not need to link them to an external resource, i am referring to
> > > co-reference. Is entitylinker good for that?
> > >
> > > Thanks
> > >
> > > 2016-11-27 0:18 GMT+01:00 Damiano Porta :
> > >
> > >> Thanks!
> > >>
> > >> Il 27/Nov/2016 00:07, "Mark G"  ha scritto:
> > >>
> > >>> The best example I think is the GeoEntityLinker addon, specifically
> the
> > >>> class GeoEntityLinker, which links a named entity to a gazetteer
> under
> > the
> > >>> hood for geotagging. It's a bit tedious to set it up, but just
> looking
> > at
> > >>> the code will probably make sense. The GeoEntityLinker is in the
> ADDONS
> > >>> repo.
> > >>>
> > >>> On Sat, Nov 26, 2016 at 5:51 PM, Damiano Porta <
> damianopo...@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hello,
> > >>>> do you have an example or a test to see how the EntityLinker works?
> > >>>> Thanks
> > >>>>
> > >>>> Damiano
> > >>>>
> > >>>
> > >>
> >
>


Re: EntityLinker example

2016-12-06 Thread Mark G
No , sorry, I don't think it would be any help for that.

Sent from my iPhone

> On Dec 6, 2016, at 5:57 AM, Damiano Porta  wrote:
> 
> Hello again Mark,
> pardon for the late reply.
> 
> Basically i would like to link entities (different types) inside sentences.
> I do not need to link them to an external resource, i am referring to
> co-reference. Is entitylinker good for that?
> 
> Thanks
> 
> 2016-11-27 0:18 GMT+01:00 Damiano Porta :
> 
>> Thanks!
>> 
>> Il 27/Nov/2016 00:07, "Mark G"  ha scritto:
>> 
>>> The best example I think is the GeoEntityLinker addon, specifically the
>>> class GeoEntityLinker, which links a named entity to a gazetteer under the
>>> hood for geotagging. It's a bit tedious to set it up, but just looking at
>>> the code will probably make sense. The GeoEntityLinker is in the ADDONS
>>> repo.
>>> 
>>> On Sat, Nov 26, 2016 at 5:51 PM, Damiano Porta 
>>> wrote:
>>> 
>>>> Hello,
>>>> do you have an example or a test to see how the EntityLinker works?
>>>> Thanks
>>>> 
>>>> Damiano
>>>> 
>>> 
>> 


Re: EntityLinker example

2016-11-26 Thread Mark G
The best example I think is the GeoEntityLinker addon, specifically the
class GeoEntityLinker, which links a named entity to a gazetteer under the
hood for geotagging. It's a bit tedious to set it up, but just looking at
the code will probably make sense. The GeoEntityLinker is in the ADDONS
repo.

On Sat, Nov 26, 2016 at 5:51 PM, Damiano Porta 
wrote:

> Hello,
> do you have an example or a test to see how the EntityLinker works?
> Thanks
>
> Damiano
>


Re: Word Sense Disambiguation

2015-01-19 Thread Mark G
+1

On Mon, Jan 19, 2015 at 1:49 PM, Tommaso Teofili 
wrote:

> +1
>
> Tommaso
>
> 2015-01-19 19:10 GMT+01:00 Joern Kottmann :
>
> > Hello,
> >
> > +1 from me to just go ahead and implement the proposed approach. One
> > goal of this implementation will be to figure out the interface we want
> > to have in OpenNLP for WSD.
> >
> > We can later extend OpenNLP with more implementations which are taking
> > different approaches.
> >
> > Jörn
> >
> > On Thu, 2015-01-15 at 16:50 +0900, Anthony Beylerian wrote:
> > > Hello,
> > >
> > > I'm new here, I previously mentioned to Jörn about my colleagues and
> > myself being interested in helping to implement this component, we were
> > thinking of starting with simple knowledge based approaches, although
> they
> > do not yield high accuracy, but as a first step they are relatively
> simple,
> > would like your opinion.
> > >
> > > Pei also mentioned "cTAKES (
> > http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-wsd/ currently
> very
> > exploratory stages here) and YTEX (
> > https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08) is also
> > just exploring WSD for the healthcare domain. It's also currently
> > knowledge/ontology base for now... It would be great to see if OpenNLP
> > supports a general domain WSD"
> > >
> > > Best,
> > >
> > > Anthony
> > >
> >
> >
> >
>


Re: [opennlp-dev] TokenNameFinderFactory new features and extension

2014-10-08 Thread Mark G
Rodrigo, thanks for this fix, let me know if you want me to test, or just
put a second set of eyes on any particular parts.
MG

On Wed, Oct 8, 2014 at 2:32 PM, Jörn Kottmann  wrote:

> Well done, this was a serious regression.
>
> You observed this earlier:
> > Only one issue remains: The requirement to add -factory parameter for
> > the -featuregen parameter to work and its backing-off to default
> > features without warning if the -factory param is not used.
>
> Is that still the case with the fix you applied?
>
> We need to add a test case to be able to detect bugs in the setup of the
> feature generation. I wouldn't be surprised if something similar happens
> again at some point.
>
> Jörn
>
> On Wed, 2014-10-08 at 17:27 +0100, Rodrigo Agerri wrote:
> > Hello,
> >
> > On Wed, Oct 8, 2014 at 8:17 AM, Jörn Kottmann 
> wrote:
> > >
> > > +1 for the first option.
> >
> > Great, I have commit and close the issue.
> >
> > Thanks!
> >
> > Rodrigo
>
>
>


Re: How to use DefaultModelBuilderUtil

2014-05-20 Thread Mark G
That is correct , sentence file does not need annotations, and the other files 
Are one name per line. 
It uses the names file to annotate the sentences, and won't annotate anything 
that's in the blacklist file.



Let me know how it goes!

Sent from my iPhone

> On May 20, 2014, at 6:08 AM, Carlos Scheidecker  wrote:
> 
> Hello all,
> 
> I am putting this question on its own thread not to get lost.
> 
> Question is about the proper usage of DefaultModelBuilderUtil.
> 
> I have not figured out the proper format of the files. Here' s what I think
> from what I have been reading. Tell me if I am write.
> 
> From class DefaultModelBuilderUtil method generateModel
> 
> @param sentencesa file that contains one sentence per line.
>* There should be at least 15K sentences
>* consisting of a representative sample
> from
>* user data
> 
> This seems to be a text file where each sentence is on one line.
> I wonder if it has to be annotated, for instance:
> 
>  Archimedes  used the method of exhaustion to
> approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
> to estimate π rigorously .
> 
> Or just:
> 
> Archimedes used the method of exhaustion to approximate the value of
> π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .
> 
> 
> @param knownEntitiesa file consisting of a simple list of
>   * unambiguous entities, one entry per
> line.
>   * For instance, if one was trying to
> build a
>   * person NER model then this file would
> be a
>   * list of person names that are
> unambiguous
>   * and are known to exist in the sentences
> 
> This would be a text file list?
> 
> Something like one name per line?
> 
> Archimedes
> Socrates
> 
> 
> 
> * @param knownEntitiesBlacklist   This file contains a list of known bad
> hits
>   * that the NER phase of this processing
> might
>   * catch early one before the model
> iterates
>   * to maturity
> 
> Same as the knownEntities but a list of what NOT to mark as an entity?
> 
> 
> The rest seemed quite straight forward.
> 
> Thanks,


Re: MarkableFileInputStreamFactory

2014-05-20 Thread Mark G
That is correct , sentence file does not need annotations, and the other files 
Are one name per line. 
It uses the names file to annotate the sentences, and won't annotate anything 
that's in the blacklist file.

Let me know how it goes!



> On May 20, 2014, at 4:16 AM, Carlos Scheidecker  wrote:
> 
> I have not move forward on it, but yes Mark, want to use it.
> 
> I have seen one of your examples.
> 
> But have not figured out the proper format of the files. Here' s what I
> think from what I have been reading. Tell me if I am write.
> 
> From class DefaultModelBuilderUtil method generateModel
> 
> @param sentencesa file that contains one sentence per line.
>* There should be at least 15K sentences
>* consisting of a representative sample
> from
>* user data
> 
> This seems to be a text file where each sentence is on one line.
> I wonder if it has to be annotated, for instance:
> 
>  Archimedes  used the method of exhaustion to
> approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
> to estimate π rigorously .
> 
> Or just:
> 
> Archimedes used the method of exhaustion to approximate the value of
> π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .
> 
> 
> @param knownEntitiesa file consisting of a simple list of
>   * unambiguous entities, one entry per
> line.
>   * For instance, if one was trying to
> build a
>   * person NER model then this file would
> be a
>   * list of person names that are
> unambiguous
>   * and are known to exist in the sentences
> 
> This would be a text file list?
> 
> Something like one name per line?
> 
> Archimedes
> Socrates
> 
> 
> 
> * @param knownEntitiesBlacklist   This file contains a list of known bad
> hits
>   * that the NER phase of this processing
> might
>   * catch early one before the model
> iterates
>   * to maturity
> 
> Same as the knownEntities but a list of what NOT to mark as an entity?
> 
> 
> The rest seemed quite straight forward.
> 
> Thanks,
> 
> Carlos.
> 
> 
> 
> 
>> On Mon, May 19, 2014 at 5:34 PM, Mark G  wrote:
>> 
>> No problem, Carlos are you using the model builder add on ?
>> 
>> 
>> Mg
>> 
>>>> On May 19, 2014, at 6:29 PM, Carlos Scheidecker 
>>> wrote:
>>> 
>>> Thanks mate! Saw you updated the code. Cheers.
>>> 
>>> 
>>>> On Mon, May 19, 2014 at 3:24 PM, Mark G  wrote:
>>>> 
>>>> OK, thanks Carlos, I think I will commit the change, seems like it
>> wouldn't
>>>> hurt. Anybody else?
>>>> 
>>>> 
>>>> On Mon, May 19, 2014 at 5:07 PM, Carlos Scheidecker <
>> nando@gmail.com
>>>>> wrote:
>>>> 
>>>>> I am having the same issue Mark.
>>>>> 
>>>>> The class is not public so it has no visibility
>>>>> inside opennlp.addons.modelbuilder.impls.GenericModelableImpl therefore
>>>> it
>>>>> cannot be built with Maven or resolved inside Eclipse.
>>>>> 
>>>>> I have also been looking at new commits to fix that and there were
>> none.
>>>>> 
>>>>> 
>>>>>> On Mon, May 12, 2014 at 1:03 PM, Mark G  wrote:
>>>>>> 
>>>>>> Does MarkableFileInputStreamFactory need to be package private? I am
>>>>> using
>>>>>> it in an addon (modelbuilder-addon), I would like to either move it or
>>>>> make
>>>>>> it a public class. Perhaps I should be using a different class
>>>>> altogether?
>>>>>> 
>>>>>> I am using it like this
>>>>>> 
>>>>>>ObjectStream lineStream =
>>>>>> new PlainTextByLineStream(new
>>>>>> MarkableFileInputStreamFactory(params.getAnnotatedTrainingDataFile()),
>>>>>> charset);
>>>>>> ObjectStream sampleStream = new
>>>>>> NameSampleDataStream(lineStream);
>>>>>> 
>>>>>> where getAnnotatedTrainingDataFile returns a java File object.
>>>>>> 
>>>>>> thanks
>> 


Re: MarkableFileInputStreamFactory

2014-05-20 Thread Mark G
 +1 to move it


> On May 20, 2014, at 4:11 AM, Jörn Kottmann  wrote:
> 
>> On 05/19/2014 11:24 PM, Mark G wrote:
>> OK, thanks Carlos, I think I will commit the change, seems like it wouldn't
>> hurt. Anybody else?
> 
> We can't do it like that, the cmdline package is marked as internal usage 
> only. Therefore the classes in there should not be
> used outside of it. Inside cmdline pacakge the code is calling this factory 
> method:
> CmdLineUtil.createInputStreamFactory
> 
> Anyway, since user will need to create InputStreamFactory instances here and 
> there it might be a good
> idea to move this class to the util package, or introduce an 
> InputStreamFactoryUtil class which offers a couple
> of factory methods for various inputs (e.g. File, URL, Classpath, etc.).
> 
> Any thoughts?
> 
> Jörn
> 


Re: MarkableFileInputStreamFactory

2014-05-19 Thread Mark G
No problem, Carlos are you using the model builder add on ?


Mg

> On May 19, 2014, at 6:29 PM, Carlos Scheidecker  wrote:
> 
> Thanks mate! Saw you updated the code. Cheers.
> 
> 
>> On Mon, May 19, 2014 at 3:24 PM, Mark G  wrote:
>> 
>> OK, thanks Carlos, I think I will commit the change, seems like it wouldn't
>> hurt. Anybody else?
>> 
>> 
>> On Mon, May 19, 2014 at 5:07 PM, Carlos Scheidecker >> wrote:
>> 
>>> I am having the same issue Mark.
>>> 
>>> The class is not public so it has no visibility
>>> inside opennlp.addons.modelbuilder.impls.GenericModelableImpl therefore
>> it
>>> cannot be built with Maven or resolved inside Eclipse.
>>> 
>>> I have also been looking at new commits to fix that and there were none.
>>> 
>>> 
>>>> On Mon, May 12, 2014 at 1:03 PM, Mark G  wrote:
>>>> 
>>>> Does MarkableFileInputStreamFactory need to be package private? I am
>>> using
>>>> it in an addon (modelbuilder-addon), I would like to either move it or
>>> make
>>>> it a public class. Perhaps I should be using a different class
>>> altogether?
>>>> 
>>>> I am using it like this
>>>> 
>>>> ObjectStream lineStream =
>>>>  new PlainTextByLineStream(new
>>>> MarkableFileInputStreamFactory(params.getAnnotatedTrainingDataFile()),
>>>> charset);
>>>>  ObjectStream sampleStream = new
>>>> NameSampleDataStream(lineStream);
>>>> 
>>>> where getAnnotatedTrainingDataFile returns a java File object.
>>>> 
>>>> thanks
>> 


Re: MarkableFileInputStreamFactory

2014-05-19 Thread Mark G
OK, thanks Carlos, I think I will commit the change, seems like it wouldn't
hurt. Anybody else?


On Mon, May 19, 2014 at 5:07 PM, Carlos Scheidecker wrote:

> I am having the same issue Mark.
>
> The class is not public so it has no visibility
> inside opennlp.addons.modelbuilder.impls.GenericModelableImpl therefore it
> cannot be built with Maven or resolved inside Eclipse.
>
> I have also been looking at new commits to fix that and there were none.
>
>
> On Mon, May 12, 2014 at 1:03 PM, Mark G  wrote:
>
> > Does MarkableFileInputStreamFactory need to be package private? I am
> using
> > it in an addon (modelbuilder-addon), I would like to either move it or
> make
> > it a public class. Perhaps I should be using a different class
> altogether?
> >
> > I am using it like this
> >
> >  ObjectStream lineStream =
> >   new PlainTextByLineStream(new
> > MarkableFileInputStreamFactory(params.getAnnotatedTrainingDataFile()),
> > charset);
> >   ObjectStream sampleStream = new
> > NameSampleDataStream(lineStream);
> >
> > where getAnnotatedTrainingDataFile returns a java File object.
> >
> > thanks
> >
>


Re: TokenNameFinder and Span probs

2014-05-12 Thread Mark G
I'll be working on this the next few days, I'll put in different tickets to
cover the changes to NameFinder and SentenceDetector.


On Wed, May 7, 2014 at 3:22 AM, Joern Kottmann  wrote:

> Hello Mark,
>
> +1 for your second solution. I believe that is much more intuitive than
> calling a method afterwards to retrieve the prob for a Span.
> it is easier to use because the prob is delivered as part of the result and
> no user action is required to obtain it.
>
> We could use this solution everywhere where a span gets returned.
>
> Jörn
>
>
>
> On Wed, May 7, 2014 at 2:18 AM, Mark G  wrote:
>
> > I am currently working on a project in which we are using NER to to pass
> > toponyms into the GeoEntityLinker addon for geotagging and I am passing
> on
> > the locations, entities, and other info into SOLR for indexing. Over the
> > years I have noticed that the TokenNameFinder interface does not include
> > all the probs() methods that the NameFinderME has, and furthermore the
> Span
> > object does not have a double field for storing a prob for itself.  Also
> > the sentenceDetector has a method called getSentenceProbabilities rather
> > than probs().
> > When I pass the Spans into the GeoEntityLinker/EntityLinker I can't get
> the
> > probs anymore because they are not in the Span objects. I can always
> extend
> > Span and add the field, or keep a 2D array of the probs for each
> sentence,
> > but wanted to see what everyone thinks about
> > 1. adding the probs methods to the TokenNameFinder interface
> > 2. adding a prob field to Span (a double)
> > 3. Having the NameFinder return the prob with each Span so it doesn't
> have
> > to be set after the call to find() using the double[] of probs
> > 4. Have the sentencedetectorME return its spans with a prob, add probs()
> > method to the SentenceDetector interface, and deprecate the
> > getSentenceProbabilities...
> >
> > Thoughts?
> >
>


MarkableFileInputStreamFactory

2014-05-12 Thread Mark G
Does MarkableFileInputStreamFactory need to be package private? I am using
it in an addon (modelbuilder-addon), I would like to either move it or make
it a public class. Perhaps I should be using a different class altogether?

I am using it like this

 ObjectStream lineStream =
  new PlainTextByLineStream(new
MarkableFileInputStreamFactory(params.getAnnotatedTrainingDataFile()),
charset);
  ObjectStream sampleStream = new
NameSampleDataStream(lineStream);

where getAnnotatedTrainingDataFile returns a java File object.

thanks


TokenNameFinder and Span probs

2014-05-06 Thread Mark G
I am currently working on a project in which we are using NER to to pass
toponyms into the GeoEntityLinker addon for geotagging and I am passing on
the locations, entities, and other info into SOLR for indexing. Over the
years I have noticed that the TokenNameFinder interface does not include
all the probs() methods that the NameFinderME has, and furthermore the Span
object does not have a double field for storing a prob for itself.  Also
the sentenceDetector has a method called getSentenceProbabilities rather
than probs().
When I pass the Spans into the GeoEntityLinker/EntityLinker I can't get the
probs anymore because they are not in the Span objects. I can always extend
Span and add the field, or keep a 2D array of the probs for each sentence,
but wanted to see what everyone thinks about
1. adding the probs methods to the TokenNameFinder interface
2. adding a prob field to Span (a double)
3. Having the NameFinder return the prob with each Span so it doesn't have
to be set after the call to find() using the double[] of probs
4. Have the sentencedetectorME return its spans with a prob, add probs()
method to the SentenceDetector interface, and deprecate the
getSentenceProbabilities...

Thoughts?


Re: OPENNLP-683

2014-05-06 Thread Mark G
+1 for me as well. Looks useful.


On Tue, May 6, 2014 at 4:48 PM, Tommaso Teofili
wrote:

> +1 from me, I think it would be an interesting and useful contribution.
>
> Tommaso
>
>
> 2014-05-06 20:50 GMT+02:00 Jörn Kottmann :
>
> > Hello,
> >
> > we got a question in
> >
> > https://issues.apache.org/jira/browse/OPENNLP-683
> >
> > if it would be interesting to implement a rule based
> > lemmatizer as explained in the issue.
> >
> > Any opinions about it?
> >
> > Jörn
> >
> >
>


Re: DocumentSample in Doccat

2014-04-27 Thread Mark G
In my local copy I have these methods in the interface:
 Map scoreMap(String text);
 SortedMap> sortedScoreMap(String text);

and these impls of them in the ME impl


  public Map scoreMap(String text) {
Map probDist = new HashMap();

double[] categorize = categorize(text);
int catSize = getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
  String category = getCategory(i);
  probDist.put(category, categorize[getIndex(category)]);
}
return probDist;

  }

  public SortedMap> sortedScoreMap(String text) {
SortedMap> descendingMap = new TreeMap>().descendingMap();
double[] categorize = categorize(text);
int catSize = getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
  String category = getCategory(i);
  double score = categorize[getIndex(category)];
  if (descendingMap.containsKey(score)) {
descendingMap.get(score).add(category);
  } else {
Set newset = new HashSet<>();
newset.add(category);
descendingMap.put(score, newset);
  }
}
return descendingMap;
  }


They are pretty simple, but if everyone agrees I can commit them (with some
java docs)





On Sat, Apr 26, 2014 at 8:39 AM, Jörn Kottmann  wrote:

> On Thu, 2014-04-24 at 19:54 -0300, William Colen wrote:
> > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > interface. It is different from other tools, for example, we can't get
> the
> > best category of one document with only one call, we need to use two
> > methods.
>
> Yes that is right. +1 to change it. Can we deprecate the old methods and
> just add new ones to not break backward compatibility?
>
> Jörn
>
>


Re: DocumentSample in Doccat

2014-04-24 Thread Mark G
William here is another thought, we could include something like this to
return a map sorted descending with the best score on top... so you can
call categoriesAsSortedMap("").firstEntry() to get the best score (which
can be the same for more that one category hence the Set as value)

  public NavigableMap> categoriesAsSortedMap(String
text) {
NavigableMap> descendingMap = new TreeMap>().descendingMap();
double[] categorize = categorize(text);
int catSize = getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
  String category = getCategory(i);
  double score = categorize[getIndex(category)];
  if (descendingMap.containsKey(score)) {
descendingMap.get(score).add(category);
  } else {
Set newset = new HashSet<>();
newset.add(category);
descendingMap.put(score, newset);
  }
}
return descendingMap;
  }


On Thu, Apr 24, 2014 at 7:04 PM, Tech mail  wrote:

> I think it might also be true that the featuregenerator interface in
> doccat is different than the others, also I don't think the tokennamefinder
> interface has a probs() method, which has always made me use the ME impl
> direct.
>
> Sent from my iPhone
>
> > On Apr 24, 2014, at 6:54 PM, William Colen 
> wrote:
> >
> > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > interface. It is different from other tools, for example, we can't get
> the
> > best category of one document with only one call, we need to use two
> > methods.
> >
> >
> >
> > 2014-04-24 18:43 GMT-03:00 Mark G :
> >
> >> William, that map looks good to me.
> >> In my current project I find this method convenient for getting back the
> >> probs over the categories in the model as a Maplet me know if
> there's
> >> anything wrong with it :)
> >>
> >> public Map categoriesAsMap(String text) {
> >>Map probDist = new HashMap();
> >>
> >>double[] categorize = categorize(text);
> >>int catSize = getNumberOfCategories();
> >>for (int i = 0; i < catSize; i++) {
> >>  String category = getCategory(i);
> >>  probDist.put(category, categorize[getIndex(category)]);
> >>}
> >>return probDist;
> >>
> >>  }
> >>
> >> perhaps we should consider adding this method to abstract some
> >> detailsjust a thought
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Apr 24, 2014 at 3:56 PM, William Colen  >>> wrote:
> >>
> >>> What do you think of adding the following field to the DocumentSample?
> >>>
> >>> Map extraInformation
> >>>
> >>>
> >>> Also, we could add the following methods to the DocumentCategorizer
> >>> interface:
> >>>
> >>> public double[] categorize(String text[], Map
> >>> extraInformation);
> >>> public double[] categorize(String documentText, Map
> >>> extraInformation);
> >>>
> >>> Any opinion?
> >>>
> >>> Thank you,
> >>> William
> >>>
> >>>
> >>> 2014-04-17 10:39 GMT-03:00 Mark G :
> >>>
> >>>> Another general doccat thought I had is this. in my projects that use
> >>>> Doccat, I created a class called a samplecollection, which simply
> >>> wrapped a
> >>>> list but then provided  a method that returned the
> >>> samples
> >>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of
> >> the
> >>>> doccat training format of all the samples). This worked out well
> >> because
> >>> I
> >>>> stored all the samples in a database, and users could CRUD samples for
> >>>> different categories. There was a map reduce job that at job startup
> >> read
> >>>> in the samples from the database into the samplecollection,
> dynamically
> >>>> generated the model, and then used the model to classify all the texts
> >>>> across the cluster; so every MR job ran the latest and greatest model
> >>> based
> >>>> on current samples. Not sure if we're interested in something like
> >> that,
> >>>> but I see several questions on stack overflow asking about iterative
> >>> model
> >>>> building, and a SampleCollection that returns a Model has worked for
> >> me.
> >>> I
> >>>> also created a Sam

Re: End of line whitespaces in Eclipse

2014-04-24 Thread Mark G
agreed



On Thu, Apr 24, 2014 at 3:50 PM, William Colen wrote:

> I think we should do it.
>
> 2014-04-22 8:50 GMT-03:00 Jörn Kottmann :
>
> > We should maybe once remove all these white spaces
> > at the end of lines. And maybe repeat that process
> > for every release.
> >
> > Now days there are tools which can diff the files
> > ignoring white space only changes.
> >
> > Any opinions?
> >
> > Jörn
> >
> > On Thu, 2014-04-10 at 19:58 -0300, William Colen wrote:
> > > When I save a .java file in Eclipse, it is removing the end of line
> > > whitespaces. I am using the
> > > http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml
> > >
> > > This is causing lots of changes in files I actually needed to change
> only
> > > one line. Do anybody know how to I avoid it?
> > >
> > > Thank you,
> > > William
> >
> >
> >
>


Re: DocumentSample in Doccat

2014-04-24 Thread Mark G
William, that map looks good to me.
In my current project I find this method convenient for getting back the
probs over the categories in the model as a Maplet me know if there's
anything wrong with it :)

public Map categoriesAsMap(String text) {
Map probDist = new HashMap();

double[] categorize = categorize(text);
int catSize = getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
  String category = getCategory(i);
  probDist.put(category, categorize[getIndex(category)]);
}
return probDist;

  }

perhaps we should consider adding this method to abstract some
detailsjust a thought





On Thu, Apr 24, 2014 at 3:56 PM, William Colen wrote:

> What do you think of adding the following field to the DocumentSample?
>
> Map extraInformation
>
>
> Also, we could add the following methods to the DocumentCategorizer
> interface:
>
> public double[] categorize(String text[], Map
> extraInformation);
> public double[] categorize(String documentText, Map
> extraInformation);
>
> Any opinion?
>
> Thank you,
> William
>
>
> 2014-04-17 10:39 GMT-03:00 Mark G :
>
> > Another general doccat thought I had is this. in my projects that use
> > Doccat, I created a class called a samplecollection, which simply
> wrapped a
> > list but then provided  a method that returned the
> samples
> > as a DoccatModel (using a properly formatted ByteArrayInputStream of the
> > doccat training format of all the samples). This worked out well because
> I
> > stored all the samples in a database, and users could CRUD samples for
> > different categories. There was a map reduce job that at job startup read
> > in the samples from the database into the samplecollection, dynamically
> > generated the model, and then used the model to classify all the texts
> > across the cluster; so every MR job ran the latest and greatest model
> based
> > on current samples. Not sure if we're interested in something like that,
> > but I see several questions on stack overflow asking about iterative
> model
> > building, and a SampleCollection that returns a Model has worked for me.
>  I
> > also created a SampleCRUD interface that abstracts storage and retrieval
> of
> > the samples I had a Postgres and Accumulo impl for sample storage.
> > just a thought, I know this can get very specific and complicated,
> thought
> > we may be able to find a middle ground by providing a framework and some
> > generic impls.
> > MG
> >
> >
> > On Thu, Apr 17, 2014 at 8:28 AM, William Colen  > >wrote:
> >
> > > Yes, I don't see how to represent the sentences and paragraphs.
> > >
> > > +1 for the generic Map as suggested by Mark. We already have such
> things
> > in
> > > other sample classes, like NameSample and the POSSample.
> > >
> > > A use case: the 20news corpus is a collection of articles, and each
> > article
> > > contains fields like "From", "Subject", "Organization". Mahout, which
> > > includes a formatter for this corpus, concatenate it all to the text
> > field,
> > > but I think we could improve accuracy by handling this metadata in a
> > > separated feature generator.
> > >
> > >
> > > 2014-04-17 8:37 GMT-03:00 Tech mail :
> > >
> > > > I agree, this goes back to the concept of having a "document"
> model...
> > > > I know in the prod systems I've used doccat, storing sentences and
> > > > paragraphs wouldn't make sense, people usually have their own domain
> > > model
> > > > for that. I still feel like if we augment the documentsample object
> > with
> > > a
> > > > generic Map it would be helpful in some cases and not constraining
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann 
> > wrote:
> > > > >
> > > > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> I've been working with the Doccat module and I am wondering if we
> > > could
> > > > >> improve its data structure for the 1.6.0 release.
> > > > >>
> > > > >> Today the DocumentSample has the following attributes:
> > > > >>
> > > > >> - String category
> > > > >> - List text
> > > > >>
> > > > >> I would suggest adding an attribute to hold metadata, or
> additional
> > > > >> contexts information. What do you think?
> > > > >
> > > > > Right now the training format contains these two fields per line.
> > > > > Do you want to change the format as well?
> > > > >
> > > > >> Also, what do you think of including sentences and paragraph
> > > > information? I
> > > > >> don't know if there is anything a feature generator can extract
> from
> > > it
> > > > to
> > > > >> improve the classification.
> > > > >
> > > > > I guess we only want to do that if there is a use case for it. It
> > will
> > > > make the processing for the clients
> > > > > more complex, since they then would have to provide sentences and
> > > > paragraphs compared to just
> > > > > a piece of text.
> > > > >
> > > > > Jörn
> > > >
> > >
> >
>


Re: DocumentSample in Doccat

2014-04-17 Thread Mark G
Another general doccat thought I had is this. in my projects that use
Doccat, I created a class called a samplecollection, which simply wrapped a
list but then provided  a method that returned the samples
as a DoccatModel (using a properly formatted ByteArrayInputStream of the
doccat training format of all the samples). This worked out well because I
stored all the samples in a database, and users could CRUD samples for
different categories. There was a map reduce job that at job startup read
in the samples from the database into the samplecollection, dynamically
generated the model, and then used the model to classify all the texts
across the cluster; so every MR job ran the latest and greatest model based
on current samples. Not sure if we're interested in something like that,
but I see several questions on stack overflow asking about iterative model
building, and a SampleCollection that returns a Model has worked for me.  I
also created a SampleCRUD interface that abstracts storage and retrieval of
the samples I had a Postgres and Accumulo impl for sample storage.
just a thought, I know this can get very specific and complicated, thought
we may be able to find a middle ground by providing a framework and some
generic impls.
MG


On Thu, Apr 17, 2014 at 8:28 AM, William Colen wrote:

> Yes, I don't see how to represent the sentences and paragraphs.
>
> +1 for the generic Map as suggested by Mark. We already have such things in
> other sample classes, like NameSample and the POSSample.
>
> A use case: the 20news corpus is a collection of articles, and each article
> contains fields like "From", "Subject", "Organization". Mahout, which
> includes a formatter for this corpus, concatenate it all to the text field,
> but I think we could improve accuracy by handling this metadata in a
> separated feature generator.
>
>
> 2014-04-17 8:37 GMT-03:00 Tech mail :
>
> > I agree, this goes back to the concept of having a "document" model...
> > I know in the prod systems I've used doccat, storing sentences and
> > paragraphs wouldn't make sense, people usually have their own domain
> model
> > for that. I still feel like if we augment the documentsample object with
> a
> > generic Map it would be helpful in some cases and not constraining
> >
> > Sent from my iPhone
> >
> > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann  wrote:
> > >
> > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > >> Hello,
> > >>
> > >> I've been working with the Doccat module and I am wondering if we
> could
> > >> improve its data structure for the 1.6.0 release.
> > >>
> > >> Today the DocumentSample has the following attributes:
> > >>
> > >> - String category
> > >> - List text
> > >>
> > >> I would suggest adding an attribute to hold metadata, or additional
> > >> contexts information. What do you think?
> > >
> > > Right now the training format contains these two fields per line.
> > > Do you want to change the format as well?
> > >
> > >> Also, what do you think of including sentences and paragraph
> > information? I
> > >> don't know if there is anything a feature generator can extract from
> it
> > to
> > >> improve the classification.
> > >
> > > I guess we only want to do that if there is a use case for it. It will
> > make the processing for the clients
> > > more complex, since they then would have to provide sentences and
> > paragraphs compared to just
> > > a piece of text.
> > >
> > > Jörn
> >
>


Temporal Extractor Proposal

2014-03-07 Thread Mark G
Hello all, I would like to propose the development of a Temporal Extraction
addon. In the industry I work in, there is a need to support search of
documents/entities by location and date mentions within the document text.
I feel pretty good about the GeoEntityLinker addon for providing geocoding,
but now I need to do date extraction.

This addon I propose would take text, and return a real java.util.Date,
with a precision, likely stored in an extended Span object. Initially, I
would like it to deal with year, seasonal, month, and day level references,
and return a real Date and a precision. Don't care so much about days of
week mentions and such, this is geared more towards supporting search and
other datetime related analytics.

I have done this before to some degree a while back, and I have done
research that leads to a couple different approaches:
1. All regex based extraction, and then a series of rules for cleaning the
results.
pros: no training, simple configuration, predictable output
cons: regexes are confusing as they mature, regexes are not context specific
2. Machine learning (like the current opennlp model/NER can do pretty well)
pros: based on user data (if trained on it), adaptive etc
cons:unpredictable strings as a result, hard to deal with.
3. A combination of Regex extraction and ML, in which the regex results are
highly specific and used for sentence annotation for building a model.
pros: model based on regex results on user data, adaptive, more recall than
option 1, more predicatble results than option 2
cons:laborious processing (run regex extraction , produce annotations,
build a model etc), still deal with unpredictable results

My recommendation is option 3. I would like to write a regex based
extractor that stands alone, but also write an impl for the
modelbuilder-addon that would use the regex based extractor to create
annotations for the model building process that occurs in the
modelbuilder-addon (which automates annotation and model building based on
user defined "known entities" and sentences). Option three would also
provide "simple" and "advanced" versions of temporal extraction.

this is a complex process, let us know if you see utility in this, and
please provide any insights.

sorry for the long email

thanks
Mark G


Re: Namefinder Changes

2014-03-05 Thread Mark G
thanks Joern, I'll take a closer look.


On Wed, Mar 5, 2014 at 2:30 PM, Jörn Kottmann  wrote:

> Have a look at the Sequence Coding thread here on the list.
>
> The name finder always used IOB2 coding by default, we made this now
> configurable and it can be replaced by other codecs such BILOU, or when
> the work is done by a user implemented codec.
>
> To detect names in a sentence the name finder uses a learn able
> classifier. The classifier
> has to decide if a token is part of name or not. The logic on which labels
> are used to encode/
> decode name spans is now the responsibility of the SequenceCodec object.
>
> In the IOB2 codec (see the BioCodec class) the tokens are labels as Begin,
> Inside, Other.
> Each new name span has to start with the Begin label.
>
> The BILOU codec uses the following labels: Begin, Inside, Last, Unit and
> Other.
>
> The might be advantages to switch the codec depending on the data you are
> using,
> in the German CONLL03 data the evaluation results are slightly better with
> BILOU
> instead of IOB2.
>
> The BILOU codec uses more labels, and will be more resource intensive
> compared to IOB2.
>
> Also have a look at the wikipedia article about IOB:
> http://en.wikipedia.org/wiki/Inside_Outside_Beginning
>
> HTH,
> Jörn
>
>
> On 03/05/2014 02:18 PM, Mark G wrote:
>
>> Hello, I updated the tools trunk two days ago and stopped getting NER
>> results. I chatted with Joern and he made a change to the seq codec that
>> brought everything back to normal. For the benefit of everyone on the dev
>> list, would it be possible for someone to explain the changes regarding
>> the
>> sequence codec: its benefits, the differences, and where in the code to
>> look to see what it is actually doing. Don't need anything elaborate, just
>> a point of departure for inquiry.
>> MG
>>
>>
>


Namefinder Changes

2014-03-05 Thread Mark G
Hello, I updated the tools trunk two days ago and stopped getting NER
results. I chatted with Joern and he made a change to the seq codec that
brought everything back to normal. For the benefit of everyone on the dev
list, would it be possible for someone to explain the changes regarding the
sequence codec: its benefits, the differences, and where in the code to
look to see what it is actually doing. Don't need anything elaborate, just
a point of departure for inquiry.
MG


Re: svn commit: r1564395 - in /opennlp/trunk/opennlp-tools/src: main/java/opennlp/tools/namefind/RegexNameFinder.java main/java/opennlp/tools/namefind/RegexNameFinderFactory.java test/java/opennlp/too

2014-02-06 Thread Mark G
ok, sounds good


On Thu, Feb 6, 2014 at 5:16 AM, Jörn Kottmann  wrote:

> Hi Mark,
>
> we should not remove the constructors, because that will break backward
> compatibility,
>
> This one: public RegexNameFinder(Pattern patterns[]) should be marked as
> deprecated,
> and the other one we should probably keep, because it makes it easy to
> just run it with one
> type.
>
> When you deprecate it, please add a comment to point the user to the other
> constructor, otherwise
> they don't know where to look.
>
> Jörn
>
>
> On 02/04/2014 06:33 PM, ma...@apache.org wrote:
>
>> Author: markg
>> Date: Tue Feb  4 17:33:59 2014
>> New Revision: 1564395
>>
>> URL: http://svn.apache.org/r1564395
>> Log:
>> OPENNLP-643
>> Removed old constructors in lieu of Map constructor and
>> changed find methods appropriately. Updated unit tests for RegexNameFinder.
>>
>> Modified:
>>  opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/
>> namefind/RegexNameFinder.java
>>  opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/
>> RegexNameFinderFactory.java
>>  opennlp/trunk/opennlp-tools/src/test/java/opennlp/tools/
>> namefind/RegexNameFinderTest.java
>>
>> Modified: opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/
>> namefind/RegexNameFinder.java
>> URL: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/
>> src/main/java/opennlp/tools/namefind/RegexNameFinder.java?
>> rev=1564395&r1=1564394&r2=1564395&view=diff
>> 
>> ==
>> --- 
>> opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/RegexNameFinder.java
>> (original)
>> +++ 
>> opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/RegexNameFinder.java
>> Tue Feb  4 17:33:59 2014
>> @@ -30,8 +30,6 @@ import opennlp.tools.util.Span;
>>*/
>>   public final class RegexNameFinder implements TokenNameFinder {
>>   -  private Pattern mPatterns[];
>> -  private String sType;
>> private Map regexMap;
>>   public RegexNameFinder(Map regexMap) {
>> @@ -42,24 +40,6 @@ public final class RegexNameFinder imple
>>   }
>>   -  public RegexNameFinder(Pattern patterns[], String type) {
>> -if (patterns == null || patterns.length == 0) {
>> -  throw new IllegalArgumentException("patterns must not be null or
>> empty!");
>> -}
>> -
>> -mPatterns = patterns;
>> -sType = type;
>> -  }
>> -
>> -  public RegexNameFinder(Pattern patterns[]) {
>> -if (patterns == null || patterns.length == 0) {
>> -  throw new IllegalArgumentException("patterns must not be null or
>> empty!");
>> -}
>> -
>> -mPatterns = patterns;
>> -sType = null;
>> -  }
>> -
>> @Override
>> public Span[] find(String tokens[]) {
>>   Map sentencePosTokenMap = new HashMap<>();
>> @@ -83,7 +63,7 @@ public final class RegexNameFinder imple
>> Collection annotations = new LinkedList<>();
>>   -if (mPatterns == null && regexMap != null) {
>> +if (regexMap != null) {
>> for (Map.Entry entry : regexMap.entrySet()) {
>>   for (Pattern mPattern : entry.getValue()) {
>> Matcher matcher = mPattern.matcher(sentenceString);
>> @@ -101,25 +81,10 @@ public final class RegexNameFinder imple
>> }
>>   }
>> }
>> -} else {
>> -  for (Pattern mPattern : mPatterns) {
>> -Matcher matcher = mPattern.matcher(sentenceString);
>> -
>> -while (matcher.find()) {
>> -  Integer tokenStartIndex =
>> -  sentencePosTokenMap.get(matcher.start());
>> -  Integer tokenEndIndex =
>> -  sentencePosTokenMap.get(matcher.end());
>> -
>> -  if (tokenStartIndex != null && tokenEndIndex != null) {
>> -Span annotation = new Span(tokenStartIndex, tokenEndIndex,
>> sType);
>> -annotations.add(annotation);
>> -  }
>> -}
>> -  }
>>   }
>> +
>>   return annotations.toArray(
>>   new Span[annotations.size()]);
>> }
>> @@ -138,7 +103,7 @@ public final class RegexNameFinder imple
>>   private Span[] getAnnotations(String text) {
>>   Collection annotations = new LinkedList<>();
>> -if (mPatterns == null && regexMap != null) {
>> +if (regexMap != null) {
>> for (Map.Entry entry : regexMap.entrySet()) {
>>   for (Pattern mPattern : entry.getValue()) {
>> Matcher matcher = mPattern.matcher(text);
>> @@ -152,20 +117,7 @@ public final class RegexNameFinder imple
>> }
>>   }
>> }
>> -} else {
>> -  for (Pattern mPattern : mPatterns) {
>> -Matcher matcher = mPattern.matcher(text);
>> -
>> -while (matcher.find()) {
>> -  Integer tokenStartIndex = matcher.start();
>> -  Integer tokenEndIndex = matcher.end();
>> -  Span annotation = new Span(tokenStartIndex, tokenEndIndex,
>> sType);
>> -  annotations.add(annotation);
>> -
>> -}
>> -  

Re: Calling all regexes

2014-02-04 Thread Mark G
thanks, glad you think so too.  you all can also post regexes to this jira
ticket if you want.

https://issues.apache.org/jira/browse/OPENNLP-643



On Tue, Feb 4, 2014 at 12:45 PM, Darrell Berry <
darrell.be...@significancesystems.com> wrote:

> What an excellent idea. Best wishes for success! My current holy grail is a
> date parser that can actually deal with real world publications
> cross-region, and make sensible guesses at resolving ambiguities!
> On 4 Feb 2014 17:42, "Mark G"  wrote:
>
> > If anyone would like to (and are allowed to) donate their favorite
> > entity-entracting-regexes (phone numbers for your country or any country
> > for that matter, dates and times, you name it) we are working on a regex
> > namefinder factory that will process a bunch of default types your
> type
> > could become one of those defaults. Just reply to this thread and I will
> > try to work them in.
> >
>


Calling all regexes

2014-02-04 Thread Mark G
If anyone would like to (and are allowed to) donate their favorite
entity-entracting-regexes (phone numbers for your country or any country
for that matter, dates and times, you name it) we are working on a regex
namefinder factory that will process a bunch of default types your type
could become one of those defaults. Just reply to this thread and I will
try to work them in.


Re: How to add POS tag feature in Namefinder

2014-01-30 Thread Mark G
Alternatively, could he pass in String[] tokens with each token having the
pos appended to it? for instance String[]{"bob_nn","went_vv","home_nn"} .
inside the createFeatures method he would then just take the contents of
the array and populate the List...



On Jan 30, 2014 8:16 AM, "Jörn Kottmann"  wrote:

> Hello,
>
> the current interface does not support passing in pos tags.
>
> One option would be to just run the pos tagger inside the feature
> generation
> of the name finder.
>
> Would that help you?
>
> Jörn
>
> On 01/30/2014 12:47 AM, Zoljargal Munkhjargal wrote:
>
>> Dear members,
>>
>> I am trying to setup the OpenNLP NameFinder in a project with
>> part-of-speech tag feature. I extended my feature class from
>> FeatureGeneratorAdapter> incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/
>> FeatureGeneratorAdapter.html>
>> class,
>> and overrode following method. Unfortunately this method taking just raw
>> tokens in parameter. The problem is that how to pass POS tag information
>> in
>> to this method?
>>
>> public void createFeatures(List features, String[] tokens, int index,
>> String[] previousOutcomes)
>>
>>
>


Re: Support for sequence models

2014-01-20 Thread Mark G
Sounds like a good change, but I have two questions: will this affect the
current APIs? Will people's maxent models still work if they are using a
Maxent model now in a component that will soon require a Seq tagging model
after the change?


On Mon, Jan 20, 2014 at 8:44 AM, Richard Eckart de Castilho <
richard.eck...@gmail.com> wrote:

> Would it still be possible to use the current 1.5.x models with OpenNLP
> after the change?
>
> -- Richard
>
> On 20.01.2014, at 07:48, Jörn Kottmann  wrote:
>
> > Hi all,
> >
> > in OpenNLP we have a couple of components which rely on sequence tagging.
> > Right now they are using a normal classifier and search for a good
> sequence via beam search.
> >
> > I would like to propose that we change that a bit, all components which
> are based on sequence
> > tagging should use a Sequence Classification Model instead of directly
> using an
> > Event Classification Model (currently named MaxentModel).
> >
> > The change will have two advantages, It will be possible to integrate ml
> algorithm which operate on a sequence
> > level (e.g. CRF) and it would be easy to exchange beam search against a
> similar (maybe enhanced) algorithm.
> >
> > On the training side we already have support for training on sequences.
> Anyway the current implementation is a bit
> > unlucky because the sequence training class can only return an Event
> Classification Model. I will change that so that
> > a Sequence Classification Model has to be returned, and the Perceptron
> Sequence Model will be returned as a
> > Sequence Classification Model instead.
> >
> > Any thoughts?
> >
> > Jörn
>


Re: Addons

2014-01-10 Thread Mark G
Coincidentally, I was going to move it today after a commit to the sandbox.
I am also going to put the modelbuilder tool I wrote in there as well
thanks


On Fri, Jan 10, 2014 at 5:42 AM, Jörn Kottmann  wrote:

> On 12/04/2013 01:16 PM, Mark G wrote:
>
>> should I move the GeoEntityLinker into this new addons dir? it is
>> currently
>> in the sandbox in a mvn module called addons.
>>
>
> Do you mind if I go ahead and move it? You should check in all changes
> before.
>
> Jörn
>


Re: Addons

2013-12-04 Thread Mark G
should I move the GeoEntityLinker into this new addons dir? it is currently
in the sandbox in a mvn module called addons.


On Fri, Nov 15, 2013 at 5:26 AM, Jörn Kottmann  wrote:

> On 11/15/2013 10:34 AM, William Colen wrote:
>
>> Nice! How will we procedure with addons which depends on incompatible
>> licenses, such as GPL? For example, an extension to use Mophologic as a
>> POS
>> Dictionary, or Weka as ML engine?
>>
>
> We could create another addon repository outside of Apache to integrate
> components which use
> dependencies which are not Apache compatible.
>
> Anyway, Morfologik is BSD licensed and I created a Morfologik addon for
> the lemmatizer contribution
> we received a while back. It would be nice if you could add your POS
> dictionary there.
>
> Jörn
>


Re: svn commit: r1544904 - in /opennlp/sandbox/opennlp-coref: ./ src/main/java/opennlp/tools/coref/ src/main/java/opennlp/tools/coref/resolver/ src/main/java/opennlp/tools/coref/sim/

2013-12-04 Thread Mark G
I have a lot of data laying around. How do I train it?


On Mon, Nov 25, 2013 at 3:02 PM, Jörn Kottmann  wrote:

> Actually that code should have compiled just fine against maxent 3.0.3.
>
> Anyway, the reason for the seperation from opennlp-tools is that we need
> to first build/finish the tooling
> to train the coref component. In my opinion this will be easier if we just
> let the code continue to use the old
> maxent library. After that is accomplished we could start updating and
> refactoring it and re-integrate it into opennlp-tools.
>
> Do you have some data sets you could train it on? I am happy to provide
> assitance and point out issues I encountered.
>
> Jörn
>
>
> On 11/24/2013 04:08 AM, ma...@apache.org wrote:
>
>> Author: markg
>> Date: Sun Nov 24 03:08:54 2013
>> New Revision: 1544904
>>
>> URL: http://svn.apache.org/r1544904
>> Log:
>> OPENNLP-621
>> Fixed errors and changed all approprate imports to opennlp.tools.ml.
>> Builds but no testing done yet.
>>
>> Modified:
>>  opennlp/sandbox/opennlp-coref/   (props changed)
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/CorefModel.java
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/resolver/DefaultNonReferentialResolver.java
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/resolver/MaxentResolver.java
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/sim/GenderModel.java
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/sim/NumberModel.java
>>  opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/sim/SimilarityModel.java
>>
>> Propchange: opennlp/sandbox/opennlp-coref/
>> 
>> --
>> --- svn:ignore (added)
>> +++ svn:ignore Sun Nov 24 03:08:54 2013
>> @@ -0,0 +1 @@
>> +target
>>
>> Modified: opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/CorefModel.java
>> URL: http://svn.apache.org/viewvc/opennlp/sandbox/opennlp-coref/
>> src/main/java/opennlp/tools/coref/CorefModel.java?rev=
>> 1544904&r1=1544903&r2=1544904&view=diff
>> 
>> ==
>> --- 
>> opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/coref/CorefModel.java
>> (original)
>> +++ 
>> opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/coref/CorefModel.java
>> Sun Nov 24 03:08:54 2013
>> @@ -26,9 +26,10 @@ import java.io.FileOutputStream;
>>   import java.io.FileReader;
>>   import java.io.IOException;
>>   import java.util.zip.GZIPInputStream;
>> -
>> -import opennlp.maxent.io.BinaryGISModelReader;
>> -import opennlp.model.AbstractModel;
>> +import opennlp.tools.ml.maxent.io.BinaryGISModelReader;
>> +//import opennlp.maxent.io.BinaryGISModelReader;
>> +//import opennlp.model.AbstractModel;
>> +import opennlp.tools.ml.model.AbstractModel;
>>   import opennlp.tools.dictionary.Dictionary;
>>   import opennlp.tools.util.StringList;
>>   import opennlp.tools.util.model.BaseModel;
>>
>> Modified: opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/resolver/DefaultNonReferentialResolver.java
>> URL: http://svn.apache.org/viewvc/opennlp/sandbox/opennlp-coref/
>> src/main/java/opennlp/tools/coref/resolver/DefaultNonReferentialResolver.
>> java?rev=1544904&r1=1544903&r2=1544904&view=diff
>> 
>> ==
>> --- opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/resolver/DefaultNonReferentialResolver.java (original)
>> +++ opennlp/sandbox/opennlp-coref/src/main/java/opennlp/tools/
>> coref/resolver/DefaultNonReferentialResolver.java Sun Nov 24 03:08:54
>> 2013
>> @@ -25,14 +25,26 @@ import java.util.ArrayList;
>>   import java.util.Iterator;
>>   import java.util.List;
>>   -import opennlp.maxent.GIS;
>> -import opennlp.maxent.io.BinaryGISModelReader;
>> -import opennlp.maxent.io.SuffixSensitiveGISModelReader;
>> -import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
>> -import opennlp.model.Event;
>> -import opennlp.model.MaxentModel;
>> +//import opennlp.maxent.GIS;
>> +//import opennlp.maxent.io.BinaryGISModelReader;
>> +//import opennlp.maxent.io.SuffixSensitiveGISModelReader;
>> +//import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
>> +//import opennlp.maxent.GIS;
>> +import opennlp.tools.ml.maxent.io.BinaryGISModelReader;
>> +import opennlp.tools.ml.maxent.GIS;
>> +import opennlp.tools.ml.maxent.io.SuffixSensitiveGISModelWriter;
>> +import opennlp.tools.ml.maxent.io.SuffixSensitiveGISModelReader;
>> +//import opennlp.maxent.io.SuffixSensitiveGISModelReader;
>> +//import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
>> +//import opennlp.model.Event;
>> +import opennlp.tools.ml.model.MaxentModel;
>> +//import opennlp.model.MaxentModel;
>> +
>> +import opennlp.tools.ml.model.EventStream;
>> +//import opennlp.model.MaxentModel;
>>   import opennlp.tools.

Re: request for Input or ideas.... EntityLinker tickets

2013-11-05 Thread Mark G
sounds good, I will get it in the sandbox...



On Tue, Nov 5, 2013 at 8:43 AM, Jörn Kottmann  wrote:

> On 11/05/2013 01:23 PM, Mark G wrote:
>
>> Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think
>> I
>> should move the entire GeoEntityLinker impl and all its classes to a new
>> module in the sandbox and leave only the entitylinker framework in opennlp
>> tools?
>>
>
> +1, it might be nice to provide a simple dictionary based implementation
> as a sample for simple
> tasks, e.g. just map country names.
> If  the user wants to use the lucene based solution he should just depend
> on the addon module, which
> we should release together with the core components.
>
>
>  A second thought/option is to make the lucene pom entries optional in the
>> opennlptools pom, so users will have to add lucene to their pom to run the
>> geoentitylinker and the jars will not be included in the tools build
>>
>
> I really prefer the other solution because then a user needs to once
> explicitly deal with it
> to make his project work, if you do this people probably start using the
> clases and then discover
> only by try and error that there must be something missing on their
> classpath.
>
> The lemmatizer will also be part of the addon solution, if there are no
> concerns I suggest we get our
> addons started.
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-11-05 Thread Mark G
Joern, I am using Lucene inside the GeoEntityLinker impl, so do you think I
should move the entire GeoEntityLinker impl and all its classes to a new
module in the sandbox and leave only the entitylinker framework in opennlp
tools?
A second thought/option is to make the lucene pom entries optional in the
opennlptools pom, so users will have to add lucene to their pom to run the
geoentitylinker and the jars will not be included in the tools build


On Tue, Nov 5, 2013 at 3:39 AM, Jörn Kottmann  wrote:

> On 11/03/2013 02:22 AM, Mark G wrote:
>
>> I finished with the Lucene indexing of the Gazateers, just need to get
>> them
>> tied into the gaz lookups, which is fairly simple. Do you all think I
>> should disregard all the MySQL dependency and just have Lucene? The lucene
>> index files are only about 2.5 gigs total, so very manageable to
>> distribute
>> the files across a cluster. I could keep the MySQL classes as an option,
>> but at this point the Lucene based approach is really growing on me.
>> If I don't here from anyone I am going to remove the MySQL implementation.
>>
>
> +1 I believe a Lucene based solution is easier to handle for most people,
> because it can
> be fully integrated via API (no need to install anything) and therefor
> hides most of
> the complexity.
>
> Please avoid adding a dependincy for lucene to the opennlp-tools project,
> I suggest that we
> add this code to the sandbox, or a new addon area. If people want to use a
> Lucene based dictionary
> they can depend on that module explicitly.
>
> Jörn
>


Re: Move to Java 7

2013-11-05 Thread Mark G
My vote is move to 7. Java 7 has been out for over 2 years.


On Tue, Nov 5, 2013 at 4:18 AM, Richard Eckart de Castilho <
richard.eck...@gmail.com> wrote:

> On 05.11.2013, at 09:35, Jörn Kottmann  wrote:
>
> > We are on Java5 currently, that makes the build for a release more and
> more inconvenient because you have to install
> > a Java5 VM to get it done.
>
> It is not necessary to have a Java5 VM to do a release for Java 5. Just
> set the source and target versions in the
> compile plugin and you are fine. I don't have a Java 5 VM since quite some
> time, but still do releases (e.g. uimaFIT)
> targeting Java 5.
>
> However, there is a risk that non-Java 5 API is being used without being
> noticed (e.g. new IOException(Throwable)).
> This can be caught by setting up a corresponding Jenkins job or eventually
> by users after a release, which then may
> require a bug-fix release.
>
> -- Richard


Re: request for Input or ideas.... EntityLinker tickets

2013-11-02 Thread Mark G
I finished with the Lucene indexing of the Gazateers, just need to get them
tied into the gaz lookups, which is fairly simple. Do you all think I
should disregard all the MySQL dependency and just have Lucene? The lucene
index files are only about 2.5 gigs total, so very manageable to distribute
the files across a cluster. I could keep the MySQL classes as an option,
but at this point the Lucene based approach is really growing on me.
If I don't here from anyone I am going to remove the MySQL implementation.
Thanks
MG


On Wed, Oct 30, 2013 at 7:34 PM, Lance Norskog  wrote:

> Just to elaborate- The RAMDirectory storage is in Java GC. This makes Java
> GC work very very hard. A memory-mapped file is a write-through cache for
> file contents. The memory in the cache is outside of Java garbage
> collection. A memory-mapped index will take a little less time to create at
> these volumes. Loading a pre-built memory-mapped index will be under 5
> seconds.
>
>
> On 10/29/2013 03:43 PM, Mark G wrote:
>
>> thanks, that was my next option with lucene. Build the indexes from the
>> gaz
>> files and keep them up to date in one place, and make sure something like
>> puppet will distribute them to each node in a cluster on some interval,
>> then each task (map reduce or whatever) can use that file resource. I'll
>> let everyone know how it goes
>> MG
>>
>>
>> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog  wrote:
>>
>>  This is what memory-mapped file indexes are for! RAMDirectory is for very
>>> small projects.
>>>
>>>
>>> On 10/29/2013 04:00 AM, Mark G wrote:
>>>
>>>  FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>>>> almost 7 GB ram and took about 40 minutes to load.
>>>> Still looking at other DBs/Indexes. So one would need at least 10G ram
>>>> to
>>>> hold the USGS and NGA gazateers.
>>>>
>>>>
>>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G  wrote:
>>>>
>>>>   I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>>>
>>>>> valid
>>>>> option to hold the gazateers and it provides good text search of
>>>>> course.
>>>>> The idea is that at runtime the geoentitylinker would pull three files
>>>>> off
>>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>>>> indicator
>>>>> file and lucene index them in memory,. initially this will take a
>>>>> while.
>>>>> So, deployment wise, you would have to use your tool of choice (ie
>>>>> Puppet)
>>>>> to distribute the files to each node, or mount a share to each node. My
>>>>> concern with this approach is that each MR Task runs in it's own JVM,
>>>>> so
>>>>> each task on each node will consume this much memory unless you do
>>>>> something interesting with memory mapping. The EntityLinkerProperties
>>>>> file
>>>>> will support the config of the file locations and whether to use DB or
>>>>> in
>>>>> mem Lucene...
>>>>>
>>>>> I am also working on a Postgres version of the gazateer structures and
>>>>> stored procs.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann 
>>>>> wrote:
>>>>>
>>>>>   On 10/23/2013 01:14 PM, Mark G wrote:
>>>>>
>>>>>>   All that being said, it is totally possible to run an in memory
>>>>>> version
>>>>>>
>>>>>>> of
>>>>>>> the gazateer. Personally, I like the DB approach, it provides a lot
>>>>>>> of
>>>>>>> flexibility and power.
>>>>>>>
>>>>>>>   Yes, and you can even use a DB to run in-memory which works with
>>>>>>> the
>>>>>>>
>>>>>> current implementation,
>>>>>> I think I will experiment with that.
>>>>>>
>>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>>>> have more than enough anyway,
>>>>>> and it makes the deployment easier (don't have to deal with installing
>>>>>> MySQL
>>>>>> databases and keeping them in sync).
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-29 Thread Mark G
thanks, that was my next option with lucene. Build the indexes from the gaz
files and keep them up to date in one place, and make sure something like
puppet will distribute them to each node in a cluster on some interval,
then each task (map reduce or whatever) can use that file resource. I'll
let everyone know how it goes
MG


On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog  wrote:

> This is what memory-mapped file indexes are for! RAMDirectory is for very
> small projects.
>
>
> On 10/29/2013 04:00 AM, Mark G wrote:
>
>> FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>> almost 7 GB ram and took about 40 minutes to load.
>> Still looking at other DBs/Indexes. So one would need at least 10G ram to
>> hold the USGS and NGA gazateers.
>>
>>
>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G  wrote:
>>
>>  I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>> valid
>>> option to hold the gazateers and it provides good text search of course.
>>> The idea is that at runtime the geoentitylinker would pull three files
>>> off
>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>> indicator
>>> file and lucene index them in memory,. initially this will take a while.
>>> So, deployment wise, you would have to use your tool of choice (ie
>>> Puppet)
>>> to distribute the files to each node, or mount a share to each node. My
>>> concern with this approach is that each MR Task runs in it's own JVM, so
>>> each task on each node will consume this much memory unless you do
>>> something interesting with memory mapping. The EntityLinkerProperties
>>> file
>>> will support the config of the file locations and whether to use DB or in
>>> mem Lucene...
>>>
>>> I am also working on a Postgres version of the gazateer structures and
>>> stored procs.
>>>
>>> Thoughts?
>>>
>>>
>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann 
>>> wrote:
>>>
>>>  On 10/23/2013 01:14 PM, Mark G wrote:
>>>>
>>>>  All that being said, it is totally possible to run an in memory version
>>>>> of
>>>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>>>> flexibility and power.
>>>>>
>>>>>  Yes, and you can even use a DB to run in-memory which works with the
>>>> current implementation,
>>>> I think I will experiment with that.
>>>>
>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>> have more than enough anyway,
>>>> and it makes the deployment easier (don't have to deal with installing
>>>> MySQL
>>>> databases and keeping them in sync).
>>>>
>>>> Jörn
>>>>
>>>>
>>>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-29 Thread Mark G
FYI, I implemented an in mem lucene index of the NGA Geonames. It was
almost 7 GB ram and took about 40 minutes to load.
Still looking at other DBs/Indexes. So one would need at least 10G ram to
hold the USGS and NGA gazateers.


On Fri, Oct 25, 2013 at 6:21 AM, Mark G  wrote:

> I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
> option to hold the gazateers and it provides good text search of course.
> The idea is that at runtime the geoentitylinker would pull three files off
> disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
> file and lucene index them in memory,. initially this will take a while.
> So, deployment wise, you would have to use your tool of choice (ie Puppet)
> to distribute the files to each node, or mount a share to each node. My
> concern with this approach is that each MR Task runs in it's own JVM, so
> each task on each node will consume this much memory unless you do
> something interesting with memory mapping. The EntityLinkerProperties file
> will support the config of the file locations and whether to use DB or in
> mem Lucene...
>
> I am also working on a Postgres version of the gazateer structures and
> stored procs.
>
> Thoughts?
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann  wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>


Re: svn commit: r1536630 - /opennlp/trunk/opennlp-tools/pom.xml

2013-10-29 Thread Mark G
Ok, thanks! I didn't know which server was the build server... same as SVN?
I will try to commit some java 7 code now
MG


On Tue, Oct 29, 2013 at 6:25 AM, Jörn Kottmann  wrote:

> On 10/29/2013 11:11 AM, Mark G wrote:
>
>> OK, I committed the POM only just to see if the CI server has Java 7 in
>> it's path. Since we haven't gotten a build error yet, it may be fine.
>> Haven't tried committing  any code with J7 objects.
>>
>
> I updated the configuration on the build server is now to Java 7, to do
> this you have to log
> in there with your apache id and then go to "Configuration" of the OpenNLP
> project.
>
> Jörn
>


Re: svn commit: r1536630 - /opennlp/trunk/opennlp-tools/pom.xml

2013-10-29 Thread Mark G
OK, I committed the POM only just to see if the CI server has Java 7 in
it's path. Since we haven't gotten a build error yet, it may be fine.
Haven't tried committing  any code with J7 objects.
thanks
MG


On Tue, Oct 29, 2013 at 6:05 AM, Jörn Kottmann  wrote:

> The README file in opennlp-distr must be updated as well,
> I will have a look at the build server.
>
> Jörn
>
>
>
> On 10/29/2013 10:49 AM, ma...@apache.org wrote:
>
>> Author: markg
>> Date: Tue Oct 29 09:49:58 2013
>> New Revision: 1536630
>>
>> URL: http://svn.apache.org/r1536630
>> Log:
>> OPENNLP-611
>> Commiting POM with 1.7 build tags.
>>
>> Modified:
>>  opennlp/trunk/opennlp-tools/**pom.xml
>>
>> Modified: opennlp/trunk/opennlp-tools/**pom.xml
>> URL: http://svn.apache.org/viewvc/**opennlp/trunk/opennlp-tools/**
>> pom.xml?rev=1536630&r1=**1536629&r2=1536630&view=diff
>> ==**==**
>> ==
>> --- opennlp/trunk/opennlp-tools/**pom.xml (original)
>> +++ opennlp/trunk/opennlp-tools/**pom.xml Tue Oct 29 09:49:58 2013
>> @@ -139,7 +139,16 @@
>> 
>>   
>> 
>> - 
>> + 
>> +
>> + org.apache.maven.**plugins
>> + maven-compiler-**plugin
>> + 2.3.2
>> + 
>> +  1.7
>> +  1.7
>> + 
>> +
>> 
>> 
>>   
>>
>>
>>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-26 Thread Mark G
I am looking at the EntityLinker interface, and I would like to add this
method (one which I think was proposed very early on). This allows for an
entire doc worth of NEs to be processed. Currently, if a scoring routine
needs all the results from the entire document, the scorer cannot be called
from within the EntityLinker impl. The below method allows for a user to
perform all NER as normal for an entire doc, then pass all that info into
this method. I realized this when writing the scoring algorithms for the
GeoEntityLinker... some require all the hits for the doc, some don't, so I
was using some scorers internally, then some after, it got messy and
confusing. This would also allow for better pipeline integration, so no
scorers would have to be chained after the EntityLinking, it would all
happen within.

Thoughts?

like this:
  public List find(String doctext, Span[] sentences, String[][]
tokens, Span[][] names) {
ArrayList spans = new ArrayList();
for (int s = 0; s < sentences.length; s++) {
  for (String name : Span.spansToStrings(names[s], tokens[s])) {
//do something
  }

}  return spans;
  }


On Wed, Oct 23, 2013 at 11:36 AM, Mark G  wrote:

> not sure if the in mem approach will provide the equivalent to full text
> indexingbut worth a try. Another design pattern is to just install one
> DB and have all the nodes connect. I have done this with Postgres on a
> 40ish node hadoop cluster. The queries against the db's full text index are
> not that expensive for mysql, it's not a complex query, just a seek on the
> full text index.  But, of course, it depends on how much concurrency it
> will get, which depends on how much data, nodes, and tasks you have
> Generically I think the right answer is to be able to configure the
> connection behind the GeoEntityLinker... in mem || remote db || locahost db
>
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann  wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>


Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-25 Thread Mark G
I did some research never configured jenkins/hudson before but it
looks like if we change the reactor POM from

org.apache.maven.plugins
maven-compiler-plugin

1.5
1.5
-Xlint


to
1.7
1.7
and make sure 1.7 is actually on the server, JAVA_HOME is set and java 1.7
is on the PATH, it should be fairly straight forward
If you have any experience/advice let me know if this makes sense or not.

thanks
MG


On Fri, Oct 25, 2013 at 12:27 PM, Mark G  wrote:

> OK, is there anything we need to do to the CI server? If so, I will need
> access to it of course.
> Thanks
> MG
>
>
> On Fri, Oct 25, 2013 at 11:39 AM, Jörn Kottmann wrote:
>
>> +1 to do it for the 1.6.0 release. Lets reach consensus to which version
>> we switch and then it can be done.
>>
>> Its not a big deal, we need to update the pom.xml files and maybe some
>> requirement
>> section in our documentation.
>>
>> It would be nice if you could take this over Mark.
>>
>> Jörn
>>
>>
>> On 10/25/2013 12:09 PM, Mark G wrote:
>>
>>> When is the best time to do it? Let me know how I can help.
>>> MG
>>>
>>>
>>> On Wed, Oct 23, 2013 at 8:02 PM, James Kosin 
>>> wrote:
>>>
>>>  +1, the good news is we don't need any changes for 7 support.  It works
>>>> as
>>>> is now.
>>>> I use 7 currently for development.
>>>>
>>>> - James Kosin
>>>>
>>>>
>>>> On 10/23/2013 2:35 PM, William Colen wrote:
>>>>
>>>>  +1 move do 6 or 7 for the next major release. We can ask what our users
>>>>> think of it.
>>>>>
>>>>> Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu:
>>>>>
>>>>>   Hi guys
>>>>>
>>>>>> I would vote for java 7, as well.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> BR,
>>>>>> Ioan
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 23, 2013 at 6:24 PM, Mark G >>>>>
>>>>>> javascript:;>>
>>>>>> wrote:
>>>>>>
>>>>>>   agree, straight to 7 makes sense to me... "try with resources",
>>>>>> better
>>>>>>
>>>>>>> collections support, switch on strings etc all new in 7
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann >>>>>> <**
>>>>>>> javascript:;>>
>>>>>>>
>>>>>>>  wrote:
>>>>>>
>>>>>>  On 10/23/2013 01:21 PM, Mark G wrote:
>>>>>>>
>>>>>>>>   When will we move to 6?
>>>>>>>>
>>>>>>>>>   I don't have any strong opinions about moving forward. Some say
>>>>>>>>> it
>>>>>>>>>
>>>>>>>> might
>>>>>>> be better
>>>>>>>
>>>>>>>> to move directly to Java 7 or even wait for Java 8.
>>>>>>>>
>>>>>>>> There are not that many interesting new features in Java 6, thats
>>>>>>>> why I
>>>>>>>> believe it might
>>>>>>>> be worth to make a bigger step to avoid one or two versions.
>>>>>>>>
>>>>>>>> Any opinions? Do we still have a Java 5 user here?
>>>>>>>>
>>>>>>>> Jörn
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>


Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-25 Thread Mark G
OK, is there anything we need to do to the CI server? If so, I will need
access to it of course.
Thanks
MG


On Fri, Oct 25, 2013 at 11:39 AM, Jörn Kottmann  wrote:

> +1 to do it for the 1.6.0 release. Lets reach consensus to which version
> we switch and then it can be done.
>
> Its not a big deal, we need to update the pom.xml files and maybe some
> requirement
> section in our documentation.
>
> It would be nice if you could take this over Mark.
>
> Jörn
>
>
> On 10/25/2013 12:09 PM, Mark G wrote:
>
>> When is the best time to do it? Let me know how I can help.
>> MG
>>
>>
>> On Wed, Oct 23, 2013 at 8:02 PM, James Kosin 
>> wrote:
>>
>>  +1, the good news is we don't need any changes for 7 support.  It works
>>> as
>>> is now.
>>> I use 7 currently for development.
>>>
>>> - James Kosin
>>>
>>>
>>> On 10/23/2013 2:35 PM, William Colen wrote:
>>>
>>>  +1 move do 6 or 7 for the next major release. We can ask what our users
>>>> think of it.
>>>>
>>>> Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu:
>>>>
>>>>   Hi guys
>>>>
>>>>> I would vote for java 7, as well.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> BR,
>>>>> Ioan
>>>>>
>>>>>
>>>>> On Wed, Oct 23, 2013 at 6:24 PM, Mark G >>>>
>>>>> javascript:;>>
>>>>> wrote:
>>>>>
>>>>>   agree, straight to 7 makes sense to me... "try with resources",
>>>>> better
>>>>>
>>>>>> collections support, switch on strings etc all new in 7
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann >>>>> javascript:;>>
>>>>>>
>>>>>>  wrote:
>>>>>
>>>>>  On 10/23/2013 01:21 PM, Mark G wrote:
>>>>>>
>>>>>>>   When will we move to 6?
>>>>>>>
>>>>>>>>   I don't have any strong opinions about moving forward. Some say it
>>>>>>>>
>>>>>>> might
>>>>>> be better
>>>>>>
>>>>>>> to move directly to Java 7 or even wait for Java 8.
>>>>>>>
>>>>>>> There are not that many interesting new features in Java 6, thats
>>>>>>> why I
>>>>>>> believe it might
>>>>>>> be worth to make a bigger step to avoid one or two versions.
>>>>>>>
>>>>>>> Any opinions? Do we still have a Java 5 user here?
>>>>>>>
>>>>>>> Jörn
>>>>>>>
>>>>>>>
>>>>>>>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-25 Thread Mark G
I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
option to hold the gazateers and it provides good text search of course.
The idea is that at runtime the geoentitylinker would pull three files off
disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
file and lucene index them in memory,. initially this will take a while.
So, deployment wise, you would have to use your tool of choice (ie Puppet)
to distribute the files to each node, or mount a share to each node. My
concern with this approach is that each MR Task runs in it's own JVM, so
each task on each node will consume this much memory unless you do
something interesting with memory mapping. The EntityLinkerProperties file
will support the config of the file locations and whether to use DB or in
mem Lucene...

I am also working on a Postgres version of the gazateer structures and
stored procs.

Thoughts?


On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann  wrote:

> On 10/23/2013 01:14 PM, Mark G wrote:
>
>> All that being said, it is totally possible to run an in memory version of
>> the gazateer. Personally, I like the DB approach, it provides a lot of
>> flexibility and power.
>>
>
> Yes, and you can even use a DB to run in-memory which works with the
> current implementation,
> I think I will experiment with that.
>
> I don't really mind using 3 GB memory for it, since my Hadoop servers have
> more than enough anyway,
> and it makes the deployment easier (don't have to deal with installing
> MySQL
> databases and keeping them in sync).
>
> Jörn
>


Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-25 Thread Mark G
When is the best time to do it? Let me know how I can help.
MG


On Wed, Oct 23, 2013 at 8:02 PM, James Kosin  wrote:

> +1, the good news is we don't need any changes for 7 support.  It works as
> is now.
> I use 7 currently for development.
>
> - James Kosin
>
>
> On 10/23/2013 2:35 PM, William Colen wrote:
>
>> +1 move do 6 or 7 for the next major release. We can ask what our users
>> think of it.
>>
>> Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu:
>>
>>  Hi guys
>>>
>>> I would vote for java 7, as well.
>>>
>>> Thank you.
>>>
>>> BR,
>>> Ioan
>>>
>>>
>>> On Wed, Oct 23, 2013 at 6:24 PM, Mark G >> javascript:;>>
>>> wrote:
>>>
>>>  agree, straight to 7 makes sense to me... "try with resources", better
>>>> collections support, switch on strings etc all new in 7
>>>>
>>>>
>>>> On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann >>> javascript:;>>
>>>>
>>> wrote:
>>>
>>>> On 10/23/2013 01:21 PM, Mark G wrote:
>>>>>
>>>>>  When will we move to 6?
>>>>>>
>>>>>>  I don't have any strong opinions about moving forward. Some say it
>>>>>
>>>> might
>>>
>>>> be better
>>>>> to move directly to Java 7 or even wait for Java 8.
>>>>>
>>>>> There are not that many interesting new features in Java 6, thats why I
>>>>> believe it might
>>>>> be worth to make a bigger step to avoid one or two versions.
>>>>>
>>>>> Any opinions? Do we still have a Java 5 user here?
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>
>


Re: svn commit: r1535339 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/EntityLinkerProperties.java

2013-10-24 Thread Mark G
sounds good, I'll remove
thanks
MG


On Thu, Oct 24, 2013 at 7:47 AM, Jörn Kottmann  wrote:

> On 10/24/2013 12:58 PM, ma...@apache.org wrote:
>
>> +  public EntityLinkerProperties(String propertiesfile) throws
>> IOException, FileNotFoundException {
>> +this.propertyFileLocation = propertiesfile;
>>   stream = new FileInputStream(**propertiesfile);
>>   props.load(stream);
>> +stream.close();
>> }
>>
>
> In other parts of OpenNLP we removed all these constructors and methods
> which take a String for a file,
> because it makes it clear from the method signature what is expected.
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
not sure if the in mem approach will provide the equivalent to full text
indexingbut worth a try. Another design pattern is to just install one
DB and have all the nodes connect. I have done this with Postgres on a
40ish node hadoop cluster. The queries against the db's full text index are
not that expensive for mysql, it's not a complex query, just a seek on the
full text index.  But, of course, it depends on how much concurrency it
will get, which depends on how much data, nodes, and tasks you have
Generically I think the right answer is to be able to configure the
connection behind the GeoEntityLinker... in mem || remote db || locahost db



On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann  wrote:

> On 10/23/2013 01:14 PM, Mark G wrote:
>
>> All that being said, it is totally possible to run an in memory version of
>> the gazateer. Personally, I like the DB approach, it provides a lot of
>> flexibility and power.
>>
>
> Yes, and you can even use a DB to run in-memory which works with the
> current implementation,
> I think I will experiment with that.
>
> I don't really mind using 3 GB memory for it, since my Hadoop servers have
> more than enough anyway,
> and it makes the deployment easier (don't have to deal with installing
> MySQL
> databases and keeping them in sync).
>
> Jörn
>


Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread Mark G
agree, straight to 7 makes sense to me... "try with resources", better
collections support, switch on strings etc all new in 7


On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann  wrote:

> On 10/23/2013 01:21 PM, Mark G wrote:
>
>> When will we move to 6?
>>
>
> I don't have any strong opinions about moving forward. Some say it might
> be better
> to move directly to Java 7 or even wait for Java 8.
>
> There are not that many interesting new features in Java 6, thats why I
> believe it might
> be worth to make a bigger step to avoid one or two versions.
>
> Any opinions? Do we still have a Java 5 user here?
>
> Jörn
>


Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread Mark G
I investigated and you are right, even though I am building with Java 5
locally, the treeset.higher(),and treeset.subset() are not there in Java
5
When will we move to 6?
MG


On Wed, Oct 23, 2013 at 3:14 AM, Jörn Kottmann  wrote:

> Probably because the build server is using a Java 5.
>
> Jörn
>
>
> On 10/23/2013 02:00 AM, ma...@apache.org wrote:
>
>> Author: markg
>> Date: Wed Oct 23 00:00:57 2013
>> New Revision: 1534864
>>
>> URL: http://svn.apache.org/r1534864
>> Log:
>> OPENNLP-608
>> Deleted GeoHashBinScorer so the build will become stable. Hudson claiming
>> methods missing from core java objects (TreeSet)not sure why
>>
>> Removed:
>>  opennlp/trunk/opennlp-tools/**src/main/java/opennlp/tools/**
>> entitylinker/GeoHashBinScorer.**java
>>
>>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
The database is only about 3GB of storage right now.Since I used pure JDBC
and JDBC style stored proc calls, it can run with any JDBC driver, and all
the connection props are in the EntityLinkerProperties file, so it can run
on other database engines. Currently it is optional to use the MySQL fuzzy
string matching, all one has to do is change the stored proc to boolean
mode rather than natural language mode. If you really mean, do we have to
use mysql FULL TEXT *INDEXING*, then no, but with around 10Million
toponymns it provides super fast lookups without consuming a lot of memory.
If I was running the OpenNLP GeoEntityLinker in say, Map Reduce, and I am
running multiple tasks on each node, I would not want to pull 3GB into
memory for each task. The way it is now one could distribute MySQL to each
node via something like Puppet and it would serve requests from the tasks
on that node. Or if they have a beefy server they could make one large
instance of MySQL and have each node connect from the cluster.
All that being said, it is totally possible to run an in memory version of
the gazateer. Personally, I like the DB approach, it provides a lot of
flexibility and power.


On Tue, Oct 22, 2013 at 2:39 PM, Jörn Kottmann  wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>
> It would probably be nice if this would work with other databases too,
> e.g. Apache Derby,
> or some in-memory database, maybe even Lucene.
>
> Would it be possible to not use the MySQL fuzzy string matching feature
> for this?
>
> I would like to run your code, but its difficult to scale the MySQL
> database in my scenario,
> but I have lots of RAM and believe the geonames dataset could fit into it
> to provide
> super fast lookups for me on my worker servers.
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
I have never used UIMA, but I have heard good things. All the analytics
processes I run are in Hadoop Mapreduce and there are cascading jobs that
do many different things. However, this sounds like a good idea for a
"solution wrapper," and I understand and agree with your concern about
creating classes which combine components.
I would like to try it in UIMA, sounds great, where in the UIMA project do
I start?


On Tue, Oct 22, 2013 at 2:29 PM, Jörn Kottmann  wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 4. provide a "solution wrapper" for the Geotagging capability
>>
>> In order to make the GeoTagging a bit more "out of the box" functional, I
>> was thinking of creating a class that one calls find(MaxentModel, doc,
>> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
>> know this is not standard practice, just want to see what you all think.
>> This would make it "easier" to get this thing running.
>>
>
>
> What do you think about using a solution like UIMA to do this? I am not
> sure how you
> are intending to run your NLP pipelines but in my experiences that has
> worked out
> really well. UIMA can help to solve some production problems like
> scalability, error handling,
> etc.
>
> If you are interested in this you could write an Analysis Engine for the
> Entity Linker and add
> it to opennlp-uima.
>
> I still believe it is not a good idea to make classes which combine
> components to use them out of
> the box, because that never really suits all of our users, and it is easy
> to implement inside a user project.
>
> Anyway we should add command line support and implement a class which can
> demonstrate how the entity linker
> works in a similar fashion as our other command line tools.
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
Currently this regex finding of countrycontext is done in a CountryContext
class which is behind the GeoEntityLinker impl itself. This class's
regexFind method takes the full doc text as a param and returns a hashmap
of each country code to a set of mentions in the doc :
 public Map> regexfind(String docText,
EntityLinkerProperties properties)
this could be done in as a NameFinder impl extension, but since it was
specific to the GeoEntityLinker impl I didn't bother, but initially I did
think of this



On Tue, Oct 22, 2013 at 2:45 PM, Jörn Kottmann  wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 2. Discovery of indicators for "country context" should be regex based, in
>> order to provide a more robust ability to discover context
>>
>> Currenty I use a String.indexOf(term) to discover the country hit list.
>> Regex would allow users to configure interesting ways to indicate
>> countries. Regex will also provide the array of start/end I need for issue
>> 1 from its Matcher.find
>>
>
> Can we reuse the name finder for this? The user could simply provide a
> name finder which
> can do this depending on what is possible for him, e.g. trained on his
> data, regex based,
> dictionary based, etc.
>
> Jörn
>


Re: Instantiation of an EntityLinker

2013-10-22 Thread Mark G
Thanks for the feedback Joern, here are my thoughts:
Totally agree about the inputstream and file overloads for EntityLinker
properties, I should update that immediately.

As for the multiple linkers, Initially I thought people may want to link to
multiple external datasets for the same named entiity with different linker
impls. But, then, like I did in the GeoEntityLinker, the linkerImpl itself
can actually orchestrate the different connectors, so multiple linkers are
unnecesary for one type. So I agree that the factory should be simplified
to return one linker. As for the other param (entitytype), currently
the type param drives which property entry to use to instantiate the
appropriate linker.
for instance the props file may be like this:

linker.location=org.apache.opennlp.tools.lntitylinker.GeoEntityLinker
linker.person=my.project.class.MyPersonLinker
linker.organization=my.project.class.MyOrgLinker

the factory will return the appropriate linker for the entity type passed
in. Without that parameter we would need a separate ELprops file for each
type... and currently the entitylinkerprops object is used for other
properties. Essentially, if we took away the type param, then each
entitylinker impl would need it's own properties file, and you would know
what file to load based on what type of entity you have. Part of the reason
for one file was to make the BaseEntityLinker simple to extend, and I am
used to working on large clusters and properties files can get out of
control so I was shooting for requiring only one from the beginning. One
file per linker is totally doable though.

MG



On Tue, Oct 22, 2013 at 3:04 PM, Jörn Kottmann  wrote:

> Hi all,
>
> the EntityLinker is created by the EntityLinkerFactory, and that one
> requires an
> EntityLinkerProperties object, which defines the EntityLinker instance
> that is supposed
> to be created.
>
> The EntityLinkerProperties object can only be created from a file, I
> suggest that we
> extend this so it can be created from an InputStream as well, similar how
> it is possible
> with our other models (can be created form InputStream, File and URL).
>
> Additionally I propose that we only have one method to create one
> EntityLinker at a time which as the only
> parameter takes the properties file, all the settings can be stored inside
> it. That would make it
> easier for me to integrate the EntityLinker, because all that is needed to
> create it is the properties
> file and no further configuration parameters.
>
> For example:
> EntityLinker createEntityLinker(**EntityLinkerProperties) throws
> IOException
>
> Is there a good reason to create multiple EntityLinkers with the same call?
>
> Any opinions?
>
> Jörn
>


Re: svn commit: r1533881 - /opennlp/sandbox/modelbuilder-prototype/

2013-10-20 Thread Mark G
all, I loaded the ModelBuilder-prototype project I mentioned earlier into
the sandbox. Please take a look when you get a chance. I have built a few
decent model with it already for locations and person entities. The Example
class will walk through how it works and you can work it from there. the
Impls used are file based impls so you should be able to create a file of
sentences, known entities, and a blacklist file to run the examples.
A good use case is something like this:
I have a corpus of data that can be broken into sentences, I know my data
so I can sample some of it to create lists of entities of different types
based on random searches (a list of people's names for example). From here
the model builder will take the list of sentences, search for all the known
entities, if it finds them it annotates the sentence and writes the anno
sentences to a file. The file is then used to create a model, the model is
used to extract NEs, then the results (if they pass validation) are added
to the list of known entities and the loop starts over
1: read sentences
extract knowns
annotate sentences based on knowns
build a model from the annotations
extract NEs with the model
add the Names to the known entities
goto 1

let me know what you think

MG



On Sun, Oct 20, 2013 at 8:59 AM,  wrote:

> Author: markg
> Date: Sun Oct 20 12:59:13 2013
> New Revision: 1533881
>
> URL: http://svn.apache.org/r1533881
> Log:
> Prototype of a tool to allow users to create models from  of a set of
> known entities based on their own data in the form of sentences.
> See the Example class in the .v2 package.
>
> Added:
> opennlp/sandbox/modelbuilder-prototype/
>
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-11 Thread Mark G
I'll take a look at the Leipzig project, not familiar with it. But the idea
is to allow users to wire up whatever data they have and not have it
particular to any format, the tool now just produces opennlp format...
however I can write a LeipzigSentenceProvider or LeipzigKnownEntityProvider
impl and it would work with the framework as is.
thanks


On Fri, Oct 11, 2013 at 6:13 AM, Jörn Kottmann  wrote:

> On 10/11/2013 11:51 AM, Mark G wrote:
>
>> Thanks Joern. Good question about license I wrote a web crawler and it
>> polls a bunch of RSS news feeds (google news and BBC mainly) as well as
>> wikipedia and then recursively scrapes to N depth on them. So It's
>> hard
>> to say what the license would be, I will look deeper, and maybe only use
>> the wiki data.
>>
>
> The Leipzig project is doing something similar for many languages, maybe
> it would be good
> solution to just make it work with their data format.
>
> What do you think?
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-11 Thread Mark G
Thanks Joern. Good question about license I wrote a web crawler and it
polls a bunch of RSS news feeds (google news and BBC mainly) as well as
wikipedia and then recursively scrapes to N depth on them. So It's hard
to say what the license would be, I will look deeper, and maybe only use
the wiki data.
thanks


On Fri, Oct 11, 2013 at 3:17 AM, Jörn Kottmann  wrote:

> On 10/10/2013 06:54 PM, Mark G wrote:
>
>> thanks, I am also working on a rapid model builder framework that I would
>> like you to look at. I posted a description earlier but no feedback yet, I
>> was thinking I could check it into the sandbox so everyone can run it,
>> along with a filebased implementation that includes a file of ~200K
>> sentences.
>> This tool would allow users to specify a file of sentences from their
>> data,
>> a file (dictionary) of known named entities, and a blacklist file (for
>> false positive reduction) in order to build a model for a specific entity
>> type.
>>
>
> +1 I posted feedback to this on the user list.
>
> Just go ahead and open a Jira issue for it, and then add it to the sandbox.
>
> What is the license of the sentence file?
>
> Jörn
>


Re: request for Input or ideas.... EntityLinker tickets

2013-10-10 Thread Mark G
thanks, I am also working on a rapid model builder framework that I would
like you to look at. I posted a description earlier but no feedback yet, I
was thinking I could check it into the sandbox so everyone can run it,
along with a filebased implementation that includes a file of ~200K
sentences.
This tool would allow users to specify a file of sentences from their data,
a file (dictionary) of known named entities, and a blacklist file (for
false positive reduction) in order to build a model for a specific entity
type.


On Thu, Oct 10, 2013 at 12:00 PM, Jörn Kottmann  wrote:

> I will have a look at it tomorrow, we are planning on using the
> entitylinker in on of
> our systems.
>
> Jörn
>
>
> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> All,
>> Before I plug some tickets into Jira, I wanted to get some feedback from
>> the team on some changes I would like to make to the EntityLinker
>> GeoEntityLinkerImpl
>> Below are what I consider improvement tickets
>>
>> 1. Only the first start and end are populated in CountryContext object
>> when
>> returned from CountryContext.find, it should return all instances of each
>> country mention in a map so the proximity of other toponyms to the found
>> country indicators can be included as a factor in the scoring
>>
>> Currently the user only gets the first indexOf for each country mention.
>> The country mentions are an attempt to better gauge ambiguous names( Paris
>> Texas rather than Paris France). Because of this, I am not able to do a
>> proximity analysis thoroughly to assist in scoring. Basically I need every
>> mention of every country indicator in the doc, which I will correlate with
>> every Named Entity span to produce a score. I am also not passing the list
>> of country codes into the database query as a where predicate, which would
>> improve performance tremendously (I will index the column).
>>
>> 2. Discovery of indicators for "country context" should be regex based, in
>> order to provide a more robust ability to discover context
>>
>> Currenty I use a String.indexOf(term) to discover the country hit list.
>> Regex would allow users to configure interesting ways to indicate
>> countries. Regex will also provide the array of start/end I need for issue
>> 1 from its Matcher.find
>>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>> 4. provide a "solution wrapper" for the Geotagging capability
>>
>> In order to make the GeoTagging a bit more "out of the box" functional, I
>> was thinking of creating a class that one calls find(MaxentModel, doc,
>> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
>> know this is not standard practice, just want to see what you all think.
>> This would make it "easier" to get this thing running.
>>
>> thanks!
>> MG
>>
>>
>


Re: svn commit: r1529520 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/MySQLGeoNamesGazLinkable.java

2013-10-07 Thread Mark G
got it


On Mon, Oct 7, 2013 at 10:11 AM, Jörn Kottmann  wrote:

> Please always reference the issue number as part of the commit log message.
>
> Jörn
>
>
> On 10/05/2013 11:20 PM, ma...@apache.org wrote:
>
>> Author: markg
>> Date: Sat Oct  5 21:20:53 2013
>> New Revision: 1529520
>>
>> URL: http://svn.apache.org/r1529520
>> Log:
>> Was not using the EntityLinkerProperties passed in. Now fixed.
>>
>> Modified:
>>  opennlp/trunk/opennlp-tools/**src/main/java/opennlp/tools/**
>> entitylinker/**MySQLGeoNamesGazLinkable.java
>>
>> Modified: opennlp/trunk/opennlp-tools/**src/main/java/opennlp/tools/**
>> entitylinker/**MySQLGeoNamesGazLinkable.java
>> URL: http://svn.apache.org/viewvc/**opennlp/trunk/opennlp-tools/**
>> src/main/java/opennlp/tools/**entitylinker/**
>> MySQLGeoNamesGazLinkable.java?**rev=1529520&r1=1529519&r2=**
>> 1529520&view=diff
>> ==**==**
>> ==
>> --- opennlp/trunk/opennlp-tools/**src/main/java/opennlp/tools/**
>> entitylinker/**MySQLGeoNamesGazLinkable.java (original)
>> +++ opennlp/trunk/opennlp-tools/**src/main/java/opennlp/tools/**
>> entitylinker/**MySQLGeoNamesGazLinkable.java Sat Oct  5 21:20:53 2013
>> @@ -4,7 +4,6 @@ package opennlp.tools.entitylinker;
>>*
>>* @author Owner
>>*/
>> -import java.io.File;
>>   import java.sql.CallableStatement;
>>   import java.sql.Connection;
>>   import java.sql.DriverManager;
>> @@ -56,8 +55,8 @@ public final class MySQLGeoNamesGazLinka
>>   return returnlocs;
>> }
>>   -  protected Connection getMySqlConnection(**EntityLinkerProperties
>> properties) throws Exception {
>> -EntityLinkerProperties property = new EntityLinkerProperties(new
>> File("c:\\temp\\opennlpmodels\**\entitylinker.properties"));
>> +  protected Connection getMySqlConnection(**EntityLinkerProperties
>> property) throws Exception {
>> +   // EntityLinkerProperties property = new EntityLinkerProperties(new
>> File("c:\\temp\\opennlpmodels\**\entitylinker.properties"));
>>   String driver = property.getProperty("mysql.**driver",
>> "org.gjt.mm.mysql.Driver");
>>   String url = property.getProperty("mysql.**url",
>> "jdbc:mysql://localhost:3306/**world");
>>   String username = property.getProperty("mysql.**username", "root");
>>
>>
>>
>


request for Input or ideas.... EntityLinker tickets

2013-10-05 Thread Mark G
All,
Before I plug some tickets into Jira, I wanted to get some feedback from
the team on some changes I would like to make to the EntityLinker
GeoEntityLinkerImpl
Below are what I consider improvement tickets

1. Only the first start and end are populated in CountryContext object when
returned from CountryContext.find, it should return all instances of each
country mention in a map so the proximity of other toponyms to the found
country indicators can be included as a factor in the scoring

Currently the user only gets the first indexOf for each country mention.
The country mentions are an attempt to better gauge ambiguous names( Paris
Texas rather than Paris France). Because of this, I am not able to do a
proximity analysis thoroughly to assist in scoring. Basically I need every
mention of every country indicator in the doc, which I will correlate with
every Named Entity span to produce a score. I am also not passing the list
of country codes into the database query as a where predicate, which would
improve performance tremendously (I will index the column).

2. Discovery of indicators for "country context" should be regex based, in
order to provide a more robust ability to discover context

Currenty I use a String.indexOf(term) to discover the country hit list.
Regex would allow users to configure interesting ways to indicate
countries. Regex will also provide the array of start/end I need for issue
1 from its Matcher.find

3. fuzzy string matching should be part of the scoring, this would allow
mysql fuzzy search to return more candidate toponyms.

Currently, the search into the MySQL gazateers is using "boolean mode" and
each NER result is passed in as a literal string. If I implement a fuzzy
string matching based score (do we have one?) the user could turn on
"natural language" mode in MySQL then we can generate a score and thresh to
allow for more recall on transliterated names etc
I would also like to use proximity to the majority of points in the
document as a disambiguation criteria as well.

4. provide a "solution wrapper" for the Geotagging capability

In order to make the GeoTagging a bit more "out of the box" functional, I
was thinking of creating a class that one calls find(MaxentModel, doc,
sentencedetector, EntityLinkerProperties) to abstract the current impl. I
know this is not standard practice, just want to see what you all think.
This would make it "easier" to get this thing running.

thanks!
MG


Re: New OpenNLP committer

2013-10-05 Thread Mark G
thanks!


On Wed, Oct 2, 2013 at 6:17 AM, William Colen wrote:

> Welcome Mark! Congratulations!
>
>
> 2013/10/2 Jörn Kottmann 
>
> > Hi,
> >
> > Please welcome Mark Giaconia as the latest new OpenNLP committer!
> >
> > Jörn
> >
>


Re: Next Steps for OpenNLP

2013-10-03 Thread Mark G
;around");
badentities.add("has");
badentities.add("turn");
badentities.add("surrounding");
badentities.add("\" No");
badentities.add("aug.");
badentities.add("or");
badentities.add("quips");
badentities.add("september");
badentities.add("[mr");
badentities.add("diseases");
badentities.add("when");
badentities.add("bbc");
badentities.add(":\"");
badentities.add("dr");
badentities.add("baby");
badentities.add("on");
badentities.add("route");
badentities.add("'");
badentities.add("\"");
badentities.add("a");
badentities.add("her");
badentities.add("'");
badentities.add("\"");
badentities.add("two");
badentities.add("that");
badentities.add(":");
badentities.add("one");
return badentities;
  }

  @Override
  public Boolean isValidEntity(String token) {
if (badentities.isEmpty()) {
  getBlacklist();
}
String[] tokens = token.toLowerCase().split(" ");
if (tokens.length >= 2) {
  for (String t : tokens) {
if (badentities.contains(t.trim())) {
  System.out.println("bad token : " + token);
  return false;

}
  }
} else {
  System.out.println("bad token : " + token);
  return false;
}

Pattern p = Pattern.compile("[^a-z ]", Pattern.CASE_INSENSITIVE |
Pattern.MULTILINE);
if (p.matcher(token).find()) {
  System.out.println("hit on [^a-z\\- ]  :  " + token);
  if (!token.toLowerCase().matches(".*[a-z]\\-[a-z].*")) {
System.out.println("bad token : " + token);
return false;
  } else {
System.out.println("false pos : " + token);
  }
}
Boolean b = true;
if (badentities.contains(token.toLowerCase())) {
  System.out.println("bad token : " + token);
  b = false;
}
return b;
  }

  @Override
  public Boolean isValidEntity(String token, double prob) {
Boolean b = false;
if (prob < MIN_SCORE_FOR_TRAINING) {
  b = false;
} else {
  b = isValidEntity(token);
}
return b;
  }

  @Override
  public Boolean isValidEntity(String token, Span namedEntity, String[]
words, String[] posWhiteList, String[] pos) {

boolean b = isValidEntity(token);
if (!b) {
  return b;
}
for(int start = namedEntity.getStart(); start < namedEntity.getEnd();
start++){
 for(String ps : pos){
   if(!ps.equals(posWhiteList[start])){
 return false;
   }
 }
}
return b;
  }
}

the Annotated Sentence writer dictates where to write the output sentences.
This is great is someone is doing this in a distributed way (like in
Hadoop), this could write out to HBase of Hdfs of something where it could
be crowdsourced or whatever...


public class GenericAnnotatedSentenceWriter implements
AnnotatedSentenceWriter {

  private String path = "c:\\temp\\opennlpmodels\\en-ner-person.train";

  @Override
  public void write(List sentences) {
try {
  FileWriter writer = new FileWriter(this.getFilePath(), false);

  for (String s : sentences) {
writer.write(s.trim() + "\n");
  }
  writer.close();
} catch (IOException ex) {

}
  }

  @Override
  public void setFilePath(String path) {
this.path = path;
  }

  @Override
  public String getFilePath() {
return path;
  }
}

if you made it this far down the email please let me know what you think. I
believe it has potential.

thanks
MG



On Thu, Oct 3, 2013 at 4:02 AM, Jörn Kottmann  wrote:

> On 10/02/2013 02:06 AM, Mark G wrote:
>
>> I've been using OpenNLP for a few years and I find the best results occur
>> when the models are generated using samples of the data they will be run
>> against, one of the reasons I like the Maxent approach. I am not sure
>> attempting to provide models will bear much fruit other than users will no
>> longer be afraid of the licensing issues associated with using them in
>> commercial systems. I do strongly think we should provide a modelbuilding
>> framework (that calls the training api) and a default impl.
>> CoincidentallyI have been building a framework and impl over the last
>> few months that creates models based on seeding an iterative process with
>> known entities and iterating through a set of supplied sentences to
>> recursively create annotations, write them, create a maxentmodel, load the
>> model, create more annotations based on the results (there is a validation
>> object involved), and so on With this method I was able to create an
>> NER model for people's names against a 200K sentence corpus that returns
>> acceptable results just by starting with a list of five highly unambiguous
>> names. I will propose the framework in more detail in the coming days and
>> supply my impl if everyone is interested.
>> As for the initial question, I would like to see OpenNLP provide a
>> framework for rapidly/semi-automatically building models out of user data,
>> and also performing entity resolution across documents, in order to assign
>> a probability to whether the "Bob" in one document is the same as "Bob" in
>> another.
>>
>>
> Sounds very interesting. The sentence-wise training data which is produced
> this way could
> also be combined with existing training data, or just be used to bootstrap
> a model to more
> efficiently label data with a document-level annotation tool.
>
> Another aspect is that this tool might be good at detecting mistakes in
> existing training data.
>
> Jörn
>
>
>


Re: Next Steps for OpenNLP

2013-10-01 Thread Mark G
I've been using OpenNLP for a few years and I find the best results occur
when the models are generated using samples of the data they will be run
against, one of the reasons I like the Maxent approach. I am not sure
attempting to provide models will bear much fruit other than users will no
longer be afraid of the licensing issues associated with using them in
commercial systems. I do strongly think we should provide a modelbuilding
framework (that calls the training api) and a default impl.
CoincidentallyI have been building a framework and impl over the last
few months that creates models based on seeding an iterative process with
known entities and iterating through a set of supplied sentences to
recursively create annotations, write them, create a maxentmodel, load the
model, create more annotations based on the results (there is a validation
object involved), and so on With this method I was able to create an
NER model for people's names against a 200K sentence corpus that returns
acceptable results just by starting with a list of five highly unambiguous
names. I will propose the framework in more detail in the coming days and
supply my impl if everyone is interested.
As for the initial question, I would like to see OpenNLP provide a
framework for rapidly/semi-automatically building models out of user data,
and also performing entity resolution across documents, in order to assign
a probability to whether the "Bob" in one document is the same as "Bob" in
another.
MG


On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz
wrote:

> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> tagger, and tokenizer.  We're grateful for a high performance library
> with an Apache license, but one of our greatest complaints is the
> quality of the models.  Yes--we're aware we can train our own--but
> most people are looking for something that is good enough out of the
> box (we aim for this with out products).  I'm not surprised that
> volunteer engineers don't want to spend their time annotating data ;-)
>
> I'm curious what other people see as the biggest shortcomings for Open
> NLP or the most important next steps for OpenNlp.  I may have an
> opportunity to contribute to the project and I'm trying to figure out
> where the community thinks the biggest impact could be made.
>
> Peace.
> Michael Schmitz
>


Re: Triplet Extraction with OpenNLP

2013-09-27 Thread Mark G
internally to the Parse class, I think, perhaps,  the showCodeTree() method
is doing similar to what you might want (as a start), it is a recursive
method for traversing through the children of the top parse object. If you
have the source code look at the Parse object, and the showCodeTree method.
I was thinking you could construct a sorted map (TreeMap) with part of
speech or chunk as a key sorted by the order it was mentioned, and then a
treeset of parts as the value to each key so you would be able to get the
first or last from the value/set depending on the position and type of the
key. Just a rough thought though
Mark G


On Fri, Sep 27, 2013 at 3:09 AM, Carlos Scheidecker wrote:

> This is awesome Mark, thanks!
>
> This will be quite useful for everybody else as well.
>
> I ended up doing mine and I went further with the other part of extraction.
>
> What I found interesting is the time it takes to load the
> model en-parser-chunking.bin which is about 36mb.
>
> So I am not loading everytime but just during object creation.
>
> Anyone has another better suggestion?
>
> cheers.
>
>
> On Thu, Sep 26, 2013 at 4:59 PM, Mark G  wrote:
>
> > Carlos.. I threw this together to show how to get a Parser running.
> > Look at what this prints, I think you may be able to iterate through
> > topParses[] and traverse the tree. If there is a more efficient way I am
> > sure the other OpenNLPers will chime in.
> >
> >
> >   public static void main(String[] args) throws InvalidFormatException,
> > IOException {
> >
> > InputStream is = new
> > FileInputStream("c:\\temp\\opennlpmodels\\en-parser-chunking.bin");
> >
> > ParserModel model = new ParserModel(is);
> > is.close();
> > Parser parser = ParserFactory.create(model);
> >
> > String sentence = "The countries broke off peace talks following the
> > Mumbai attacks but have begun discussions again, focusing on increasing
> > trade.";
> > Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
> >
> > Parse p = topParses[0];
> > p.showCodeTree();
> > p.show();
> > p.getParent();
> > p.getChildren();
> >
> >
> > System.out.println(p.getText());
> >   }
> >
> > It should print all this...
> >
> > [0] S 2092924121 -> 2092924121 TOP The countries broke off peace talks
> > following the Mumbai attacks but have begun discussions again, focusing
> on
> > increasing trade.
> > [0.0] NP 2092766686 -> 2092924121 S The countries
> > [0.0.0] DT 2092752996 -> 2092766686 NP The
> > [0.0.0.0] TK 2092752996 -> 2092752996 DT The
> > [0.0.1] NNS 2092969298 -> 2092766686 NP countries
> > [0.0.1.0] TK 2092969298 -> 2092969298 NNS countries
> > [0.1] VP 2093633263 -> 2092924121 S broke off peace talks following the
> > Mumbai attacks but have begun discussions again, focusing on increasing
> > trade.
> > [0.1.0] VP 2093545647 -> 2093633263 VP broke off peace talks following
> the
> > Mumbai attacks
> > [0.1.0.0] VBD 2093484042 -> 2093545647 VP broke
> > [0.1.0.0.0] TK 2093484042 -> 2093484042 VBD broke
> > [0.1.0.1] PRT 2093793436 -> 2093545647 VP off
> > [0.1.0.1.0] RP 2093793436 -> 2093793436 PRT off
> > [0.1.0.1.0.0] TK 2093793436 -> 2093793436 RP off
> > [0.1.0.2] NP 2094012476 -> 2093545647 VP peace talks
> > [0.1.0.2.0] NN 2094004262 -> 2094012476 NP peace
> > [0.1.0.2.0.0] TK 2094004262 -> 2094004262 NN peace
> > [0.1.0.2.1] NNS 2094316394 -> 2094012476 NP talks
> > [0.1.0.2.1.0] TK 2094316394 -> 2094316394 NNS talks
> > [0.1.0.3] PP 2094660013 -> 2093545647 VP following the Mumbai attacks
> > [0.1.0.3.0] VBG 2094634002 -> 2094660013 PP following
> > [0.1.0.3.0.0] TK 2094634002 -> 2094634002 VBG following
> > [0.1.0.3.1] NP 2095166543 -> 2094660013 PP the Mumbai attacks
> > [0.1.0.3.1.0] DT 2095146008 -> 2095166543 NP the
> > [0.1.0.3.1.0.0] TK 2095146008 -> 2095146008 DT the
> > [0.1.0.3.1.1] NNP 2095358203 -> 2095166543 NP Mumbai
> > [0.1.0.3.1.1.0] TK 2095358203 -> 2095358203 NNP Mumbai
> > [0.1.0.3.1.2] NNS 2095723726 -> 2095166543 NP attacks
> > [0.1.0.3.1.2.0] TK 2095723726 -> 2095723726 NNS attacks
> > [0.1.1] CC 2096134426 -> 2093633263 VP but
> > [0.1.1.0] TK 2096134426 -> 2096134426 CC but
> > [0.1.2] VP 2096419178 -> 2093633263 VP have begun discussions again,
> > focusing on increasing trade.
> > [0.1.2.0] VBP 2096343883 -> 2096419178 VP have
> > [0.1.2.0.0] TK 2096343883 -> 2096343883 VBP have
> > [

Re: Triplet Extraction with OpenNLP

2013-09-26 Thread Mark G
Carlos.. I threw this together to show how to get a Parser running.
Look at what this prints, I think you may be able to iterate through
topParses[] and traverse the tree. If there is a more efficient way I am
sure the other OpenNLPers will chime in.


  public static void main(String[] args) throws InvalidFormatException,
IOException {

InputStream is = new
FileInputStream("c:\\temp\\opennlpmodels\\en-parser-chunking.bin");

ParserModel model = new ParserModel(is);
is.close();
Parser parser = ParserFactory.create(model);

String sentence = "The countries broke off peace talks following the
Mumbai attacks but have begun discussions again, focusing on increasing
trade.";
Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);

Parse p = topParses[0];
p.showCodeTree();
p.show();
p.getParent();
p.getChildren();


System.out.println(p.getText());
  }

It should print all this...

[0] S 2092924121 -> 2092924121 TOP The countries broke off peace talks
following the Mumbai attacks but have begun discussions again, focusing on
increasing trade.
[0.0] NP 2092766686 -> 2092924121 S The countries
[0.0.0] DT 2092752996 -> 2092766686 NP The
[0.0.0.0] TK 2092752996 -> 2092752996 DT The
[0.0.1] NNS 2092969298 -> 2092766686 NP countries
[0.0.1.0] TK 2092969298 -> 2092969298 NNS countries
[0.1] VP 2093633263 -> 2092924121 S broke off peace talks following the
Mumbai attacks but have begun discussions again, focusing on increasing
trade.
[0.1.0] VP 2093545647 -> 2093633263 VP broke off peace talks following the
Mumbai attacks
[0.1.0.0] VBD 2093484042 -> 2093545647 VP broke
[0.1.0.0.0] TK 2093484042 -> 2093484042 VBD broke
[0.1.0.1] PRT 2093793436 -> 2093545647 VP off
[0.1.0.1.0] RP 2093793436 -> 2093793436 PRT off
[0.1.0.1.0.0] TK 2093793436 -> 2093793436 RP off
[0.1.0.2] NP 2094012476 -> 2093545647 VP peace talks
[0.1.0.2.0] NN 2094004262 -> 2094012476 NP peace
[0.1.0.2.0.0] TK 2094004262 -> 2094004262 NN peace
[0.1.0.2.1] NNS 2094316394 -> 2094012476 NP talks
[0.1.0.2.1.0] TK 2094316394 -> 2094316394 NNS talks
[0.1.0.3] PP 2094660013 -> 2093545647 VP following the Mumbai attacks
[0.1.0.3.0] VBG 2094634002 -> 2094660013 PP following
[0.1.0.3.0.0] TK 2094634002 -> 2094634002 VBG following
[0.1.0.3.1] NP 2095166543 -> 2094660013 PP the Mumbai attacks
[0.1.0.3.1.0] DT 2095146008 -> 2095166543 NP the
[0.1.0.3.1.0.0] TK 2095146008 -> 2095146008 DT the
[0.1.0.3.1.1] NNP 2095358203 -> 2095166543 NP Mumbai
[0.1.0.3.1.1.0] TK 2095358203 -> 2095358203 NNP Mumbai
[0.1.0.3.1.2] NNS 2095723726 -> 2095166543 NP attacks
[0.1.0.3.1.2.0] TK 2095723726 -> 2095723726 NNS attacks
[0.1.1] CC 2096134426 -> 2093633263 VP but
[0.1.1.0] TK 2096134426 -> 2096134426 CC but
[0.1.2] VP 2096419178 -> 2093633263 VP have begun discussions again,
focusing on increasing trade.
[0.1.2.0] VBP 2096343883 -> 2096419178 VP have
[0.1.2.0.0] TK 2096343883 -> 2096343883 VBP have
[0.1.2.1] VP 2096672443 -> 2096419178 VP begun discussions again, focusing
on increasing trade.
[0.1.2.1.0] VBN 2096605362 -> 2096672443 VP begun
[0.1.2.1.0.0] TK 2096605362 -> 2096605362 VBN begun
[0.1.2.1.1] NP 2096925708 -> 2096672443 VP discussions
[0.1.2.1.1.0] NNS 2096925708 -> 2096925708 NP discussions
[0.1.2.1.1.0.0] TK 2096925708 -> 2096925708 NNS discussions
[0.1.2.1.2] PP 2097584197 -> 2096672443 VP again, focusing on increasing
trade.
[0.1.2.1.2.0] IN 2097543127 -> 2097584197 PP again,
[0.1.2.1.2.0.0] TK 2097543127 -> 2097543127 IN again,
[0.1.2.1.2.1] S 2097938768 -> 2097584197 PP focusing on increasing trade.
[0.1.2.1.2.1.0] VP 2097938768 -> 2097938768 S focusing on increasing trade.
[0.1.2.1.2.1.0.0] VBG 2097910019 -> 2097938768 VP focusing
[0.1.2.1.2.1.0.0.0] TK 2097910019 -> 2097910019 VBG focusing
[0.1.2.1.2.1.0.1] PP 2098394645 -> 2097938768 VP on increasing trade.
[0.1.2.1.2.1.0.1.0] IN 2098370003 -> 2098394645 PP on
[0.1.2.1.2.1.0.1.0.0] TK 2098370003 -> 2098370003 IN on
[0.1.2.1.2.1.0.1.1] NP 2098546604 -> 2098394645 PP increasing trade.
[0.1.2.1.2.1.0.1.1.0] VBG 2098537021 -> 2098546604 NP increasing
[0.1.2.1.2.1.0.1.1.0.0] TK 2098537021 -> 2098537021 VBG increasing
[0.1.2.1.2.1.0.1.1.1] NN 2099103787 -> 2098546604 NP trade.
[0.1.2.1.2.1.0.1.1.1.0] TK 2099103787 -> 2099103787 NN trade.
(TOP (S (NP (DT The) (NNS countries)) (VP (VP (VBD broke) (PRT (RP off))
(NP (NN peace) (NNS talks)) (PP (VBG following) (NP (DT the) (NNP Mumbai)
(NNS attacks (CC but) (VP (VBP have) (VP (VBN begun) (NP (NNS
discussions)) (PP (IN again,) (S (VP (VBG focusing) (PP (IN on) (NP (VBG
increasing) (NN trade.)))
The countries broke off peace talks following the Mumbai attacks but have
begun discussions again, focusing on increasing trade

let me know how it works

happy coding!

Mark G



On Thu, Sep 26, 2013 at 4:14 PM, Carlos Scheidecker