Re: How to sanitize and parse noisy text

2014-07-15 Thread William Colen
A while back I had a similar problem while extracting text from HTML using Tika. What I did was to hack the Tika HTML parser to extract the text as I needed. I can't remember exactly how it was, but as far as I remember Tika raises events when it finds a markup (at least a HTML markup), that is no

Re: TokenNameFinder and Span probs

2014-05-11 Thread William Colen
probs > > 4. Have the sentencedetectorME return its spans with a prob, add probs() > > method to the SentenceDetector interface, and deprecate the > > getSentenceProbabilities... > > > > Thoughts? > > > -- William Colen

Re: DocumentSample in Doccat

2014-04-28 Thread William Colen
eryone agrees I can commit them (with some > java docs) > > > > > > On Sat, Apr 26, 2014 at 8:39 AM, Jörn Kottmann wrote: > > > On Thu, 2014-04-24 at 19:54 -0300, William Colen wrote: > > > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer >

Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
t; > } > > perhaps we should consider adding this method to abstract some > detailsjust a thought > > > > > > On Thu, Apr 24, 2014 at 3:56 PM, William Colen >wrote: > > > What do you think of adding the following field to the DocumentSample? >

Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
I had a Postgres and Accumulo impl for sample storage. > just a thought, I know this can get very specific and complicated, thought > we may be able to find a middle ground by providing a framework and some > generic impls. > MG > > > On Thu, Apr 17, 2014 at 8:28 AM, William C

Re: End of line whitespaces in Eclipse

2014-04-24 Thread William Colen
y changes. > > Any opinions? > > Jörn > > On Thu, 2014-04-10 at 19:58 -0300, William Colen wrote: > > When I save a .java file in Eclipse, it is removing the end of line > > whitespaces. I am using the > > http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Fo

Re: DocumentSample in Doccat

2014-04-17 Thread William Colen
umentsample object with a > generic Map it would be helpful in some cases and not constraining > > Sent from my iPhone > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann wrote: > > > >> On 04/15/2014 07:45 PM, William Colen wrote: > >> Hello, > >>

Re: svn commit: r1587969 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/doccat/NGramFeatureGenerator.java

2014-04-16 Thread William Colen
What do you think of this change? This can break compatibility with old Doccat models created using the NGramFeatureGenerator. But probably the old models are not working anyway. Thank you William 2014-04-16 13:39 GMT-03:00 : > Author: colen > Date: Wed Apr 16 16:39:40 2014 > New Revision: 158

Re: svn commit: r1587944 [1/2] - in /opennlp/trunk/opennlp-tools/src: main/java/opennlp/tools/cmdline/doccat/ main/java/opennlp/tools/doccat/ main/java/opennlp/tools/sentdetect/ main/java/opennlp/tool

2014-04-16 Thread William Colen
Jörn, Can you please review my change to the ExtensionLoader? I modified it to accept singletons (private constructor and the field INSTANCE). Thank you, William 2014-04-16 12:26 GMT-03:00 : > Author: colen > Date: Wed Apr 16 15:26:24 2014 > New Revision: 1587944 > > URL: http://svn.apache.org

DocumentSample in Doccat

2014-04-15 Thread William Colen
Hello, I've been working with the Doccat module and I am wondering if we could improve its data structure for the 1.6.0 release. Today the DocumentSample has the following attributes: - String category - List text I would suggest adding an attribute to hold metadata, or additional contexts info

Re: Doccat evaluator

2014-04-11 Thread William Colen
stderr to print the misclassified documents. If reportOutputFile is set, the evaluator will print to it some detailed reports, for example the f-measure for the different outcomes and the confusion matrix. 2014-04-10 19:48 GMT-03:00 William Colen : > Yes, I just finished implementing the confus

End of line whitespaces in Eclipse

2014-04-10 Thread William Colen
When I save a .java file in Eclipse, it is removing the end of line whitespaces. I am using the http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml This is causing lots of changes in files I actually needed to change only one line. Do anybody know how to I avoid it? Thank you,

Re: Doccat evaluator

2014-04-10 Thread William Colen
n 04/10/2014 03:00 PM, William Colen wrote: > >> Actually, since we always add a tag to each document, accuracy makes >> sense. >> We could implement F-1 for the individual categories. >> >> 2014-04-09 17:23 GMT-03:00 William Colen : >> >> Hello, >&

Re: Doccat evaluator

2014-04-10 Thread William Colen
Actually, since we always add a tag to each document, accuracy makes sense. We could implement F-1 for the individual categories. 2014-04-09 17:23 GMT-03:00 William Colen : > Hello, > > I was checking if there is any open issue related to Doccat, and I found > this one - > >

Doccat evaluator

2014-04-09 Thread William Colen
Hello, I was checking if there is any open issue related to Doccat, and I found this one - OPENNLP-81: Add a cli tool for the doccat evaluation support I noticed that there is already a class named DocumentCategorizerEvaluator, which is not used anywhere internally. This is evaluating performanc

Re: CoNLL02 format issue

2014-03-12 Thread William Colen
If it helps, there is another Spanish corpus at CONLL02 page which has 3 fields: "Xavier Carreras provides the Spanish data sets with part of speech tags (20030803)" William 2014-03-12 9:43 GMT-03:00 Roque Vera : > I found an issue in TokenNa

Re: Sequence coding

2014-02-19 Thread William Colen
Is the SequenceValidator the only thing we need to change? If a corpus uses BILOU, the formatters need to convert it to IOB2? 2014-02-19 7:01 GMT-03:00 Jörn Kottmann : > Hi all, > > the chunker and name finder both use IOB2 sequence coding. The logic > to do that is hard coded in both component

Re: Addons

2013-11-15 Thread William Colen
Nice! How will we procedure with addons which depends on incompatible licenses, such as GPL? For example, an extension to use Mophologic as a POS Dictionary, or Weka as ML engine? 2013/11/15 Jörn Kottmann > Hello,, > > the addons should be located in a folder next to the sandbox folder, > we wil

Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread William Colen
ove directly to Java 7 or even wait for Java 8. > > > > > > There are not that many interesting new features in Java 6, thats why I > > > believe it might > > > be worth to make a bigger step to avoid one or two versions. > > > > > > Any opinions? Do we still have a Java 5 user here? > > > > > > Jörn > > > > > > -- William Colen

Re: New OpenNLP committer

2013-10-02 Thread William Colen
Welcome Mark! Congratulations! 2013/10/2 Jörn Kottmann > Hi, > > Please welcome Mark Giaconia as the latest new OpenNLP committer! > > Jörn >

Re: Pluggable Machine Learning support

2013-05-30 Thread William Colen
ackage as part of this issue? Moving the things would avoid a second > interface layer and > probably make using OpenNLP Tools a bit easier, because then we are down > to a single jar. > > Jörn > > > On 05/30/2013 08:57 PM, William Colen wrote: > >> +1 to add pluggab

Re: OPENNLP-579

2013-05-30 Thread William Colen
I like the second approach Span[] find(String text, Span sentences[], Span tokens[]) looks like it would be easier to use. Maybe we could add a new tokenize method in Tokenizer which takes the sentence offset and outputs spans with this offset included. I could not understand what do you mean wi

Re: Pluggable Machine Learning support

2013-05-30 Thread William Colen
+1 to add pluggable machine learning algorithms +1 to improve the API and remove deprecated methods in 1.6.0 You can assign related Jira issues to me and I will be glad to help. On Thu, May 30, 2013 at 11:53 AM, Jörn Kottmann wrote: > Hi all, > > we spoke about it here and there already, to en

Re: OPENNLP-579

2013-05-23 Thread William Colen
It is a very nice contribution! I am looking forward to help extending it to other entity types. On Thu, May 23, 2013 at 10:04 AM, Jörn Kottmann wrote: > Hi all, > > please have a look at > https://issues.apache.org/**jira/browse/OPENNLP-579 > >

Re: Size of training data

2013-04-26 Thread William Colen
>From command line you can specify memory using MAVEN_OPTS="-Xmx4048m" You can also set it as JVM arguments if you are using from the API: java -Xmx4048m ... On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov < svetoslav.mari...@findwise.com> wrote: > I use the API. Can one specify the memor

Re: Proposal: Move the coref component into the sandbox

2013-04-18 Thread William Colen
+1 On Wed, Apr 17, 2013 at 11:07 PM, Jason Baldridge wrote: > +1 to doing this. I already removed that from Chalk for similar reasons. > Also, the best way to do coreference these days is to build on the > rule-based sieve approach given in this paper: > > http://www.mitpressjournals.org/doi/abs

Re: Please review the 1.5.3 release announcement

2013-04-17 Thread William Colen
le. > > I already promoted the maven repo and pushed the release to the dist area, > but it might take a bit until everything > is available, the mirrors might need 24 hours to mirror the distributables. > > Jörn > > > On 04/15/2013 09:46 PM, William Colen wrote: > >

Please review the 1.5.3 release announcement

2013-04-15 Thread William Colen
Hello, Please review the release announcement for the OpenNLP version 1.5.3. https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.5.3 The announce mails will be sent to the users and announce@apache lists. Thank you, William

Re: [VOTE] Release OpenNLP 1.5.3 RC 3

2013-04-15 Thread William Colen
The vote passed. We received 4 +1 binding votes and no -1 votes. The following people voted: +1 Jason Baldridge (binding) +1 James Kosin (binding) +1 Jörn Kottmann (binding) +1 William Colen (binding) Thank you for voting! On Tue, Apr 9, 2013 at 7:44 PM, James Kosin wrote: > +1 > >

Re: [VOTE] Release OpenNLP 1.5.3 RC 3

2013-04-09 Thread William Colen
+1 Approve the release On Tue, Apr 9, 2013 at 9:51 AM, William Colen wrote: > Hello, > > Lets vote to release RC 3 as OpenNLP 1.5.3. > > The testing of it is documented here: > https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.5.3 > > The RC can be

[VOTE] Release OpenNLP 1.5.3 RC 3

2013-04-09 Thread William Colen
Hello, Lets vote to release RC 3 as OpenNLP 1.5.3. The testing of it is documented here: https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.5.3 The RC can be downloaded here: http://people.apache.org/~colen/releases/opennlp-1.5.3/rc3 Please vote to approve this release: [ ] +1 Approv

Re: OpenNLP 1.5.3 RC 3 ready for testing

2013-04-09 Thread William Colen
+1 What about the similarity component? Should we build it only after the 1.5.3 release? William Colen On Tue, Apr 9, 2013 at 5:19 AM, Jörn Kottmann wrote: > Are we ready to start with the release vote? The last important test item > that is missing > is checking the signatures,

OpenNLP 1.5.3 RC 3 ready for testing

2013-04-03 Thread William Colen
Hi all, Our third release candidate is ready for testing. RC 2 failed to pass in only a few tests, including the creation of the issues list and the NOTICE file. Also, some new bug fixes and improvements were recently included. The RC 3 can be downloaded from here: http://people.apache.org/~colen

Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-04-03 Thread William Colen
> > > On 04/03/2013 01:23 PM, William Colen wrote: > >> Thank you, I fixed it. I will start the build of RC3 right now. >> >> >> >> On Wed, Apr 3, 2013 at 5:01 AM, Jörn Kottmann wrote: >> >> On 04/03/2013 02:10 AM, William Colen wrote: >&

Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-04-03 Thread William Colen
Thank you, I fixed it. I will start the build of RC3 right now. On Wed, Apr 3, 2013 at 5:01 AM, Jörn Kottmann wrote: > On 04/03/2013 02:10 AM, William Colen wrote: > >> Thank you, Jörn. >> >> I also had to update the maven-changes-plugin version. The 2.3 was failing

Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-04-02 Thread William Colen
/12319040> > > I already updated the pom and committed the change. > > > Jörn > > On 03/08/2013 03:11 PM, William Colen wrote: > >> Hi all, >> >> Our second release candidate is ready for testing. RC1 failed to pass the >> initial quality check. >&

Re: Doccat : Different tokenizers for training and categorizing?

2013-03-28 Thread William Colen
In my opinion you are right. It would be safer to use whitespace tokenizer than SimpleTokenizer. But I could not check if DoccatTrainerTool is using whitespace tokenizer. Actually, the only DocumentSample provider we have today is the one that reads Leipzig corpus, and as far as I know it uses the

Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread William Colen
AM, Jörn Kottmann wrote: > Hello, > > do we have any public data we can test the sentence detector and tokenizer > on? > It would be nice to remove the private data test for these at some point. > > Jörn > > > On 03/08/2013 03:11 PM, William Colen wrote: > &g

Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-14 Thread William Colen
; James > > > > On 3/8/2013 9:11 AM, William Colen wrote: > >> Hi all, >> >> Our second release candidate is ready for testing. RC1 failed to pass the >> initial quality check. >> >> The RC 2 can be downloaded from here: >> http://people.apache.org/~

Re: Re: English 300k sentences Leipzig Corpus for test

2013-03-14 Thread William Colen
inal Message > Subject:Re: English 300k sentences Leipzig Corpus for test > Date: Thu, 14 Mar 2013 09:48:21 -0300 > From: William Colen > To: Jörn Kottmann > > > > Yes, you can forward. > > It is not clear to me how to convert it. I could only find convert

OpenNLP 1.5.3 RC 2 ready for testing

2013-03-08 Thread William Colen
Hi all, Our second release candidate is ready for testing. RC1 failed to pass the initial quality check. The RC 2 can be downloaded from here: http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/ To use it in a maven build set the version for opennlp-tools or opennlp-uima to 1.5.3, and fo

Uploading JWNL to Maven Central Repo

2013-02-19 Thread William Colen
lease 1.3 rc3 they distribute? Thank you, William On Tue, Feb 19, 2013 at 11:09 AM, Benson Margulies wrote: > yes, ossrh will do that > > On Feb 19, 2013, at 8:38 AM, William Colen > wrote: > > > Should we try to upload it to Central Repo using "jwnl"

Re: Next release

2013-02-19 Thread William Colen
Should we try to upload it to Central Repo using "jwnl" as groupid? What do you think? On Mon, Feb 18, 2013 at 3:03 PM, Benson Margulies wrote: > upload to central via ossrh. > > On Feb 18, 2013, at 12:46 PM, William Colen > wrote: > > > We are using "jwnl

Re: Next release

2013-02-18 Thread William Colen
eleased OpenNLP versions. On Mon, Feb 18, 2013 at 12:07 PM, Jörn Kottmann wrote: > On 02/18/2013 03:17 PM, William Colen wrote: > >> I suppose we can't use opennlp.apache.org to host it, can we? >> >> > We probably could somehow distribute it from Apache servers

Re: Next release

2013-02-18 Thread William Colen
I suppose we can't use opennlp.apache.org to host it, can we? On Mon, Feb 18, 2013 at 10:57 AM, Jörn Kottmann wrote: > On 02/18/2013 02:07 AM, Lance Norskog wrote: > >> OPENNLP-510 Maven dependency on jwnl is broken >> >> The version of JWNL used in coreference does not have an available >> Mav

Re: Next release

2013-02-18 Thread William Colen
With jwnl 1.4_rc3 the code at least compiles. Also, it would be nice if someone familiar with the Coreference module could add some tests to the test plan: https://cwiki.apache.org/OPENNLP/testplan153.html On Sun, Feb 17, 2013 at 10:07 PM, Lance Norskog wrote: > OPENNLP-510 Maven dependency o

Tasks distribution for release 1.5.3

2013-02-15 Thread William Colen
Hi, As example of what we had for 1.5.2, I prepared in our wiki a table with tasks related to the release process. If you can contribute with the release process, please select some tasks from the table by adding yourself as responsible for that task. http://cwiki.apache.org/confluence/display/O

Re: Next release

2013-02-14 Thread William Colen
. > RegardsBoris > > > > > > > Date: Thu, 14 Feb 2013 13:57:06 +0100 > > From: kottm...@gmail.com > > To: dev@opennlp.apache.org > > Subject: Re: Next release > > > > On 02/14/2013 01:31 PM, William Colen wrote: > > > I can be the R

Re: Next release

2013-02-14 Thread William Colen
Hi!! I can be the Release Manager for 1.5.3. It would be nice because Jörn was the Release Manager for all the other releases and we should have other members of the team familiar with the process. I would like to candidate myself as the Release Manager for 1.5.3. I will start building our first

Re: Migrate to Git?

2012-12-19 Thread William Colen
+1 to move after 1.5.3 William Colen On Wed, Dec 19, 2012 at 10:09 PM, James Kosin wrote: > I've used both > > Only thing is I find SVN a little easier on the beginner. > Git has many options that aren't so obvious as to the purpose... until > you get your hands

Re: Next release

2012-09-12 Thread William Colen
Yes, I am also using the trunk for my projects as well. I still need to implement the customization factory for the chunker. It is important to me to have it in the next release. I will start working on it today and it should be ready by the end of the week. Thanks William On Wed, Sep 12, 2012 a

Re: BaseToolFactory should use the ExtensionLoader

2012-08-03 Thread William Colen
ed to create the factory programmatically. On Tue, Jul 17, 2012 at 9:33 AM, Jörn Kottmann wrote: > On 07/14/2012 05:35 AM, William Colen wrote: > >> I think it is better now, but I am not happy with the fact that we have >> two >> constructors in each tool factory. >

Re: BaseToolFactory should use the ExtensionLoader

2012-07-13 Thread William Colen
because of the reflection. Thanks, William On Fri, Jul 13, 2012 at 10:24 AM, Jörn Kottmann wrote: > Nice, thats the OSGi issue: > https://issues.apache.org/**jira/browse/OPENNLP-500<https://issues.apache.org/jira/browse/OPENNLP-500> > > Jörn > > > On 07/13/2012 03:13 PM, Wi

Re: BaseToolFactory should use the ExtensionLoader

2012-07-13 Thread William Colen
Yes, thanks. I will do it. On Fri, Jul 13, 2012 at 10:08 AM, Jörn Kottmann wrote: > Hello, > > yes that would work. Why don't we define the init method on the > BaseToolFactory, > then it could be called via theFactory.init? > > Jörn > > > On 07/13/2012 03:04

Re: BaseToolFactory should use the ExtensionLoader

2012-07-13 Thread William Colen
theFactory, artifactProvider); } catch . } } return theFactory; } On Fri, Jul 13, 2012 at 9:01 AM, Jörn Kottmann wrote: > On 07/13/2012 01:47 PM, William Colen wrote: > >> I've been postponing working on the OSGi support for the BaseToolFactory, >> I

Re: BaseToolFactory should use the ExtensionLoader

2012-07-13 Thread William Colen
Hi Jörn, I've been postponing working on the OSGi support for the BaseToolFactory, I am sorry. I managed to have some free time this week, so I can do it right now if you didn't start it already. Also, before releasing 1.5.3, I would like to have the factory mechanism available in the Chunker. I

POS Tagger dictionary and FSA dictionaries

2012-07-08 Thread William Colen
Hi, I have a large tagger dictionary (~600k words), and it is taking a long time (~45 seconds) to load it with the current implementation. Part of the time it is spent loading the XML into memory, and other part is spent validating the tags, if they are known by the model. Maybe we should change

Re: Interested in contributing

2012-06-03 Thread William Colen
Hi, On Sun, Jun 3, 2012 at 7:19 AM, Jim - FooBar(); wrote: > On 03/06/12 07:31, Eranga Mapa wrote: > >> When I try to run command: mvn3 install inside openNLP-trunk directory, it >> fires an error >> saying "The goal you specified requires a project to execute but there is >> no POM in this dire

Re: Unicode danda in sentence detector

2012-05-30 Thread William Colen
As far as I know you don't need a CLA for a patch. Simply open a Jira and attach your patch to it. Besides what James pointed, you may also want change the EOS characters. There are two related new features that are already implemented in the trunk: https://issues.apache.org/jira/browse/OPENNLP-4

<    1   2