New lines at end of source files

2017-02-13 Thread Jeffrey Zemerick
On a recent pull request there was a comment that some new source files did
not have new lines at the ends of the files. When I added a rule to
checkstyle for new lines at the ends of files there were a good number of
files in violation of that rule so I added the new lines to those files.
Before I submit it as a pull request and because it touches a lot of files
I wanted to discuss it here first.

The changes: 
*https://github.com/apache/opennlp/compare/master...jzonthemtn:checkstyle-nl?expand=1
*

Be happy to open it as a pull request if it's ok, or If not, no problem!

Thanks,
Jeff


Hardcoded length in prefix and suffix feature generators

2017-02-09 Thread Jeffrey Zemerick
Hi,

I noticed that the length is hardcoded to 4 in the PrefixFeatureGenerator
and the SuffixFeatureGenerator. I made this value configurable in the XML
for each feature generator. I also add a check for the length to keep
duplicate prefixes or suffixes being returned. (If the token is "yes" with
a length of 4 there would be two "yes" features returned.) If a value is
not provided in the XML it uses the default value of 4.

You can preview the changes here:
https://github.com/apache/opennlp/compare/master...jzonthemtn:prefixsuffix?expand=1

If this is a change that's desired by the group I can make a JIRA and a
pull request.

Thanks,
Jeff


Re: Multiple models and String.intern

2017-02-08 Thread Jeffrey Zemerick
I did not know that about StringTableSize. I thought it was more of a hard
limit. That's good to know. Thanks

On Wed, Feb 8, 2017 at 2:16 PM, Joern Kottmann  wrote:

> The StringTableSize doesn't limit the amount of Strings that can be stored
> in the pool, if the size is too small it just gets slower.
> This would only be done for loading models, querying the model wouldn't be
> affected. The predicate / feature strings would be interned.
>
> Jörn
>
>
>
> On Wed, Feb 8, 2017 at 6:37 PM, Jeffrey Zemerick 
> wrote:
>
> > Would it be possible to have an option or setting somewhere that
> determines
> > if string pooling is used? The option would provide backward
> compatibility
> > in case someone has to adjust the -XX:StringTableSize because their
> > existing models exceed the default JVM limit, and an option would also be
> > useful for cases when the models were made from different data sources.
> > (I'm assuming in that case using string pooling would be detrimental to
> > performance.)
> >
> > Jeff
> >
> >
> > On Wed, Feb 8, 2017 at 5:50 AM, Joern Kottmann 
> wrote:
> >
> > > Hello all,
> > >
> > > I often run multiple models in production, often trained on the same
> data
> > > but with different types (typical name finder scenario). There could be
> > one
> > > model to detect person names, and another to detection locations. The
> > > predicate Strings inside those models are always the same but the
> models
> > > can't share the same String instance.
> > >
> > > I would like to propose that we use String.intern in the model reader
> to
> > > ensure one string is only loaded once.
> > >
> > > We tried that in the past and this caused lots of issues with PermGen
> > > space, but this was improved over time in Java. In Java 8 (on which we
> > > depend now) this should work properly.
> > >
> > > Here is an interesting article about it:
> > > http://java-performance.info/string-intern-in-java-6-7-8/
> > >
> > > Using String.intern will make the model loading a bit slower (we can
> > > benchmark that).
> > >
> > > Jörn
> > >
> >
>


Re: Multiple models and String.intern

2017-02-08 Thread Jeffrey Zemerick
Would it be possible to have an option or setting somewhere that determines
if string pooling is used? The option would provide backward compatibility
in case someone has to adjust the -XX:StringTableSize because their
existing models exceed the default JVM limit, and an option would also be
useful for cases when the models were made from different data sources.
(I'm assuming in that case using string pooling would be detrimental to
performance.)

Jeff


On Wed, Feb 8, 2017 at 5:50 AM, Joern Kottmann  wrote:

> Hello all,
>
> I often run multiple models in production, often trained on the same data
> but with different types (typical name finder scenario). There could be one
> model to detect person names, and another to detection locations. The
> predicate Strings inside those models are always the same but the models
> can't share the same String instance.
>
> I would like to propose that we use String.intern in the model reader to
> ensure one string is only loaded once.
>
> We tried that in the past and this caused lots of issues with PermGen
> space, but this was improved over time in Java. In Java 8 (on which we
> depend now) this should work properly.
>
> Here is an interesting article about it:
> http://java-performance.info/string-intern-in-java-6-7-8/
>
> Using String.intern will make the model loading a bit slower (we can
> benchmark that).
>
> Jörn
>


Re: [VOTE] Apache OpenNLP 1.7.2 Release Candidate

2017-02-03 Thread Jeffrey Zemerick
+1 (non-binding) Build and tests pass with no issues.



On Fri, Feb 3, 2017 at 4:15 AM, Joern Kottmann  wrote:

> +1
>
> I did run the eval tests and they all run through except one test which
> needed more memory, that test case has to be adapted to run fast and with
> much less memory, we should do that for the 1.7.3 release.
>
> Jörn
>
> On Wed, Feb 1, 2017 at 5:52 PM, Suneel Marthi  wrote:
>
> > The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> > 1.7.2
> > Release Candidate.
> >
> > The Release artifacts can be downloaded from:
> >
> > https://repository.apache.org/content/repositories/
> > orgapacheopennlp-1010/org/apache/opennlp/opennlp-distr/1.7.2/
> >
> > The release was made from the Apache OpenNLP 1.7.2 tag at
> >
> > https://github.com/apache/opennlp/tree/opennlp-1.7.2
> >
> > To use it in a maven build set the version for opennlp-tools or
> > opennlp-uima
> > to 1.7.2
> >
> > and add the following URL to your settings.xml file:
> >
> > https://repository.apache.org/content/repositories/orgapacheopennlp-1010
> >
> > The artifacts have been signed with the Key - D3541808 found at
> >
> > http://people.apache.org/keys/group/opennlp.asc
> >
> > Please vote on releasing these packages as Apache OpenNLP 1.7.2. The vote
> > is
> >
> > open for either the next 72 hours or a minimum of 3 +1 PMC binding votes
> > whichever happens earlier.
> >
> > Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the
> >
> > release candidate and voice their approval or disapproval. The vote
> passes
> >
> > if at least three binding +1 votes are cast.
> >
> > [ ] +1 Release the packages as Apache OpenNLP 1.7.2
> >
> > [ ] -1 Do not release the packages because...
> >
> > Thanks again to all the committers and contributors for their work
> > over the past
> > few weeks.
> >
>


Re: [VOTE] Apache OpenNLP 1.7.1 Release Candidate 1

2017-01-21 Thread Jeffrey Zemerick
I went to the opennlp-distr/README for a summary of changes in 1.7.1 but I
think it is the same as it was for 1.7.0. Is that file typically updated
for revision releases? The link at the bottom of the RELEASE_NOTES to the
fixed JIRA issues is issuesFixed/jira-report.html. Minor stuff but thought
I'd ask.

Otherwise, a non-binding +1.

On Fri, Jan 20, 2017 at 6:18 PM, Suneel Marthi  wrote:

> The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP
> 1.7.1 Release Candidate.
>
> The Release artifacts can be downloaded from:
> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1008/org/apache/opennlp/opennlp-distr/1.7.1/
>
> The release was made from the Apache OpenNLP 1.7.1 tag at
> https://github.com/apache/opennlp/tree/opennlp-1.7.1
>
> To use it in a maven build set the version for opennlp-tools or
> opennlp-uima to 1.7.1
> and add the following URL to your settings.xml file:
> https://repository.apache.org/content/repositories/orgapacheopennlp-1008
>
> The artifacts have been signed with the Key - D3541808 found at
> http://people.apache.org/keys/group/opennlp.asc
>
> Please vote on releasing these packages as Apache OpenNLP 1.7.1. The vote
> is
> open for either the next 72 hours or a minimum of 3 +1 PMC binding votes.
>
> Only votes from OpenNLP PMC are binding, but folks are welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache OpenNLP 1.7.1
> [ ] -1 Do not release the packages because...
> [ ]  0 I Care Less/I Don't Care
>
> Thanks again to all the committers and contributors for their work over the
> past few weeks.
>
> Suneel Marthi
>


Re: Commit message style

2017-01-09 Thread Jeffrey Zemerick
I'm personally a fan of the issue number being the first thing on the
subject line, like "OPENNLP-xxx: commit message." For me it gives a
consistent place to look for the issue without having to read the full
message. (That way you can also see the issue number in GitHub's commit
list without having to expand the commit.)

On Mon, Jan 9, 2017 at 1:48 PM, Joern Kottmann  wrote:

> It doesn't matter where the jira# is placed, as long as it is there.
>
> Can be in the first line or occur somewhere later in the message,
> for example see OPENNLP-914. There it was placed in the body.
>
> Jörn
>
> On Mon, 2017-01-09 at 13:20 -0500, Suneel Marthi wrote:
> > I guess the reason to include the jira# at the beginning of the
> > message is
> > because the same would be reflected in the corresponding jira (i
> > could be
> > wrong here).
> >
> > I am not sure if omitting the issue# in the git subject line would
> > still
> > reflect the git convo in jira or not.
> >
> >
> >
> > On Mon, Jan 9, 2017 at 8:26 AM, Joern Kottmann 
> > wrote:
> >
> > > Hello all,
> > >
> > > we are using different styles for commit messages. It would be good
> > > to have
> > > a short discussion on how we think they should be and agree all on
> > > how to
> > > write the subject line.
> > >
> > > Here are few points from me:
> > > - Good commit messages are important to understand what happened in
> > > the
> > > project and motivate to produce well thought through commits
> > > - In git we have a subject line, first line in the commit message,
> > > should
> > > be around 50 chars, GH cuts after 72 chars and knows this
> > > convention
> > > - Subject line is usually written in imperative (git convention)
> > > - Capitalize the first word (like in a new sentence)
> > > - Commit message should contain the issue symbol
> > >
> > > Open questions:
> > > - Should the issue symbol be in the subject line? Or in the body?
> > > - Everyone fine with writing subject line in imperative?
> > >
> > > Here is an interesting article about it:
> > > http://chris.beams.io/posts/git-commit/
> > >
> > > Jörn
> > >
>


TODO in GeneratorFactory.java

2016-12-13 Thread Jeffrey Zemerick
Hi everyone,

I came across a TODO in GeneratorFactory.java to make
the TokenClassFeatureGenerator construction configurable. My cursory search
of the JIRA didn't show any related tasks. I presumed the configuration was
referring to the boolean generateWordAndClassFeature argument in the
constructor of TokenClassFeatureGenerator. I added the parameter to the XML
and pass that value to the constructor. If there is no parameter in the XML
it sets to `true` to maintain backward compatibility. A diff of my changes
can be seen at
https://github.com/apache/opennlp/compare/trunk...jzonthemtn:TokenClassFeatureGenerator
.

If this was still a valid "todo" item and the changes I made were what was
desired I can submit a patch or a pull request.

Jeff


Re: AdaptiveFeatureGenerator and FeatureGeneratorAdapter

2016-07-05 Thread Jeffrey Zemerick
Hi Jörn,

Yes, it would break backward compatibility. I will take your advice and try
that. Thanks!

Jeff


On Mon, Jul 4, 2016 at 5:53 AM, Joern Kottmann  wrote:

> Hello,
>
> as far as I understand the proposed change will break backward
> compatibility for users who implement AdaptiveFeatureGenerator.
> Is that correct?
>
> Anyway, I always like the idea of making things simpler. In Java 8 it is
> possible to declare default methods in an interface.
> http://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html
>
> This can probably be done without breaking backward compatibility and then
> we will mark Feature Generator Adapter as deprecated so it can be removed
> one day.
>
> Would be nice if you could open an issue for it.
>
> Jörn
>
> On Mon, Jun 27, 2016 at 11:57 PM, Jeffrey Zemerick 
> wrote:
>
> > Hi all,
> >
> > Under the package opennlp.tools.util.featuregen there is an
> > interface AdaptiveFeatureGenerator and an abstract
> > class FeatureGeneratorAdapter. The interface defines
> > the createFeatures(), updateAdaptiveData(), and clearAdaptiveData()
> > methods. The abstract class implements this interface to provide default
> > implementations of the updateAdaptiveData() and clearAdaptiveData()
> > functions. Feature generators then either implement the interface or
> extend
> > the abstract class.
> >
> > I created a patch to refactor these classes to remove the interface and
> use
> > the abstract class. (My motivation was I kept getting
> > AdaptiveFeatureGenerator and FeatureGeneratorAdapter confused due to
> their
> > similar naming and the inconsistency of feature generators either
> > implementing or extending.) The project does build and test with the
> patch
> > applied.
> >
> > If you think this is a worthwhile change I'll submit it on JIRA. If not,
> no
> > problem and I'll work on not being confused. (Or if there's a reason for
> > both the interface and the abstract class that I'm not aware of please
> let
> > me know!)
> >
> > Thanks,
> > Jeff
> >
>


AdaptiveFeatureGenerator and FeatureGeneratorAdapter

2016-06-27 Thread Jeffrey Zemerick
Hi all,

Under the package opennlp.tools.util.featuregen there is an
interface AdaptiveFeatureGenerator and an abstract
class FeatureGeneratorAdapter. The interface defines
the createFeatures(), updateAdaptiveData(), and clearAdaptiveData()
methods. The abstract class implements this interface to provide default
implementations of the updateAdaptiveData() and clearAdaptiveData()
functions. Feature generators then either implement the interface or extend
the abstract class.

I created a patch to refactor these classes to remove the interface and use
the abstract class. (My motivation was I kept getting
AdaptiveFeatureGenerator and FeatureGeneratorAdapter confused due to their
similar naming and the inconsistency of feature generators either
implementing or extending.) The project does build and test with the patch
applied.

If you think this is a worthwhile change I'll submit it on JIRA. If not, no
problem and I'll work on not being confused. (Or if there's a reason for
both the interface and the abstract class that I'm not aware of please let
me know!)

Thanks,
Jeff


Re: Performances of OpenNLP tools

2016-06-20 Thread Jeffrey Zemerick
I saw the same question on the users list on June 17. At least I thought it
was the same question -- sorry if it wasn't.

On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Well, hold on. He sent that mail (as of the time of this mail) 4
> mins previously. Maybe some folks need some time to reply ^_^
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++
>
>
>
>
>
>
>
>
>
>
> On 6/20/16, 8:23 AM, "Jeffrey Zemerick"  wrote:
>
> >Hi Mondher,
> >
> >Since you didn't get any replies I'm guessing no one is aware of any
> >resources related to what you need. Google Scholar is a good place to look
> >for papers referencing OpenNLP and its methods (in case you haven't
> >searched it already).
> >
> >Jeff
> >
> >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> >mondher.bouaz...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Apologies if you received multiple copies of this email. I sent it to
> the
> >> users list a while ago, and haven't had an answer yet.
> >>
> >> I have been looking for a while if there is any relevant work that
> >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> >> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
> >> Twitter data, etc., and/or compared it to other libraries.
> >>
> >> By performances, I mean accuracy/precision, rather than time of
> execution,
> >> etc.
> >>
> >> If anyone can refer me to a paper or a work done in this context, that
> >> would be of great help.
> >>
> >> Thank you very much.
> >>
> >> Mondher
> >>
>


Re: Performances of OpenNLP tools

2016-06-20 Thread Jeffrey Zemerick
Hi Mondher,

Since you didn't get any replies I'm guessing no one is aware of any
resources related to what you need. Google Scholar is a good place to look
for papers referencing OpenNLP and its methods (in case you haven't
searched it already).

Jeff

On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
mondher.bouaz...@gmail.com> wrote:

> Hi,
>
> Apologies if you received multiple copies of this email. I sent it to the
> users list a while ago, and haven't had an answer yet.
>
> I have been looking for a while if there is any relevant work that
> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> Tokenizer and PoS-Tagger) when used with short and noisy texts such as
> Twitter data, etc., and/or compared it to other libraries.
>
> By performances, I mean accuracy/precision, rather than time of execution,
> etc.
>
> If anyone can refer me to a paper or a work done in this context, that
> would be of great help.
>
> Thank you very much.
>
> Mondher
>


Re: Surronding tokens of the entity on MaxEnt models

2016-05-01 Thread Jeffrey Zemerick
I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

The WindowFeatureGenerator is only one of the available feature generators.
There are many classes that implement the AdaptiveFeatureGenerator
interface [1] and you can, of course, provide your own implementation of
that interface to support additional features. For example, the
SentenceFeatureGenerator [2] looks at the beginning and end of each
training sentence. So to answer your question, the length of the training
sentence should not matter - what matters is if the combination of
configured feature generators used can provide a model that accurately
describes the training text.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html


On Sun, May 1, 2016 at 12:02 PM, Damiano Porta 
wrote:

> Hi Jeff!
> Thank you so much for your fast reply.
>
> I have a doubt, let suppose we use this feature with a window of:
>
> 2 tokens on the left + *ENTITY* + 2 tokens on the right
>
> The doubt is how can i train the model correctly?
>
> if only the previous 2 tokens and the next 2 tokens matters i should not
> use long sentences to training the model. Right?
>
> For example (person-model.train):
>
> 1. I am  Barack  and I am the president of USA
>
> 2. My name is  Barack  and my surname is Obama
>
> ...
>
> Those are two stupid training samples, it is just to let you know my doubt.
>
> In this case i should have:
>
> *I am Barack and I*
>
> *name is Barack and my*
>
> the others tokens (left and right) do not matter. So the sentences on my
> training set should be very short, right? Basically I should only define
> all the "combinations" of the previous/next 2 tokens, right?
>
> Thank you!
> Damiano
>
>
>
> 2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick :
>
> > I think you are looking for the WindowFeatureGenerator [1]. You can set
> the
> > size of the window by specifying the number of previous tokens and number
> > of next tokens.
> >
> > Jeff
> >
> > [1]
> >
> >
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
> >
> >
> > On Sun, May 1, 2016 at 5:16 AM, Damiano Porta 
> > wrote:
> > >
> > > Hello everybody
> > > How many surrounding tokens are kept into account to find the entity
> > using
> > > a maxent model?
> > > Basically a maxent model should detect an entity looking at the
> > surronding
> > > tokens, right ?
> > > I would like to understand if:
> > >
> > > 1. can i set the number of tokens on the left side?
> > > 2. can i set the number of tokens on the right side too ?
> > >
> > > Thank you in advance for the clarification
> > > Best
> > >
> > > Damiano
> >
>


Re: Surronding tokens of the entity on MaxEnt models

2016-05-01 Thread Jeffrey Zemerick
I think you are looking for the WindowFeatureGenerator [1]. You can set the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html


On Sun, May 1, 2016 at 5:16 AM, Damiano Porta 
wrote:
>
> Hello everybody
> How many surrounding tokens are kept into account to find the entity using
> a maxent model?
> Basically a maxent model should detect an entity looking at the surronding
> tokens, right ?
> I would like to understand if:
>
> 1. can i set the number of tokens on the left side?
> 2. can i set the number of tokens on the right side too ?
>
> Thank you in advance for the clarification
> Best
>
> Damiano


OPENNLP-837

2016-03-11 Thread Jeffrey Zemerick
Hi all,

I attached a patch to OPENNLP-837 (
https://issues.apache.org/jira/browse/OPENNLP-837). With the patch, if the
number of unique training events is zero a new exception
(InsufficientTrainingDataException) is thrown. The model creation halts on
the exception. As before, if you have any comments or want anything done
differently please let me know!

Thanks,
Jeff


OPENNLP-488

2016-03-04 Thread Jeffrey Zemerick
Hi,

I attached a patch to OPENNLP-488 to remove the NPE when training with
no events. Still an OpenNLP beginner so all feedback is welcome.

Thanks,
Jeff