Re: DOAP (Description of a project) file

2013-04-03 Thread Jörn Kottmann
You are still missing a link to your DOAP file here: https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/files.xml Let me know I should quickly add you there, otherwise Pei should have access too. Jörn On 03/21/2013 08:15 PM, Mattmann, Chris A (388J) wrote: Great job Ja

Re: ClearNLP POSTagger

2013-04-09 Thread Jörn Kottmann
Would it be possible to run some benchmarks so we know the performance difference between the two? The OpenNLP POS Tagger can be customized, currently is possible to replace the feature generation, it can probably be optimized for the medical domain, the default feature generation is tuned for

Re: ClearNLP POSTagger

2013-04-09 Thread Jörn Kottmann
On 04/09/2013 10:42 PM, Chen, Pei wrote: Let me know if you get a chance to try it out/run some benchmarks see how it performs against the current. The OpenNLP POS Tagger has built in evaluation, if you have test data you could run the evaluator on it or the cross evaluator if you only have

Re: roadmap for Apache cTakes "big data" processing

2013-04-29 Thread Jörn Kottmann
On 04/29/2013 01:43 AM, Andy McMurry wrote: I encourage committers to checkout Apache Mahout https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Why Apache Mahout? 1. provides ML classifiers and functions not available through UIMA 2. parallel by design, transparently invokes Hadoop 3.

Re: sentence detector newline behavior

2013-05-22 Thread Jörn Kottmann
On 05/21/2013 08:00 PM, Steven Bethard wrote: So perhaps we could re-train it to disambiguate newline characters as well? Yes, the OpenNLP Sentence Detector now supports that in the new 1.5.3 version out of the box, you can specify the set of EOS chars to use, but the default is still: !?. If

Re: sentence detector newline behavior

2013-05-22 Thread Jörn Kottmann
On 05/22/2013 01:17 PM, Miller, Timothy wrote: That's awesome! It might be worth trying at least. How does the training process change? Previously the training data would be one sentence per line, but with newlines as possible mid-sentence characters that could be trouble, is there a new represen

Re: supported java releases in current ctakes development and future releases (3.1 and later)

2013-06-04 Thread Jörn Kottmann
On 06/04/2013 04:34 PM, Masanz, James J. wrote: For reference: the UIMA 2.4.0 documentation [1] says "As of release 2.2.1, the UIMA SDK requires a Java 1.5 level (or later). ... The release has been tested with Java 5 and 6. It has been tested on mainly on Windows XP and Linux Intel 32bit platf

Re: resources for training modules

2013-09-02 Thread Jörn Kottmann
On 08/30/2013 12:07 AM, William Karl Thompson wrote: In case it's relevant, I'm intending to use the brat rapid annotation tool (http://brat.nlplab.org/) to generate a gold standard corpus. The OpenNLP trunk contains parsing code for brat. It should be fairly easy to use it to produce XMI fi

Re: Apache cTAKES > cTAKES 3.1 User Install Guide

2013-09-18 Thread Jörn Kottmann
On 09/18/2013 08:59 PM, Karthik Sarma wrote: I would imagine that ctakes would run fine on a non-english OS configuration... java is supposed to be good at taking care of that (of course, unless we introduce internationalized strings, it'd remain in english). I agree with James that the best gues

Re: Common Type System across systems?

2013-10-01 Thread Jörn Kottmann
On 10/01/2013 02:38 AM, Pei Chen wrote: Richard, I, and few others had an interesting bar conversation... In the spirit of interoperability, What if we had a baseline common type system that could be reused across UIMA compatible NLP systems? Imagine for a moment that OpenNLP, ClearTK, ClearNLP,

Re: Training new models

2013-11-21 Thread Jörn Kottmann
On 11/20/2013 09:53 PM, Chen, Pei wrote: Re:https://issues.apache.org/jira/browse/CTAKES-268 Joern- could you confirm- I think in the latest OpenNLP versions, you can simply call something like SentenceModel.serialize(outputstream) now to save the models? Yes, excatly, this how a model in Open

Re: sentence detector newline behavior

2014-01-20 Thread Jörn Kottmann
the training data a user could just use a special tag to identify the chars. Instead of it might be better to use and to encode these two chars in the training data. Any thoughts? I am planning to release this as part of OpenNLP 1.6.0. Thanks, Jörn On 05/22/2013 02:03 PM, Jörn Kottmann

Re: sentence detector newline behavior

2014-01-21 Thread Jörn Kottmann
to the feature representation is necessary. Vj On Monday, January 20, 2014, Jörn Kottmann wrote: Hi all, currently I have quite a bit of time to work on OpenNLP, and would like to help you out with this issue. Here is the follow up issue for this change: https://issues.apache.org/jira/browse/OPENNLP-

Re: sentence detector newline behavior

2014-01-24 Thread Jörn Kottmann
On 01/23/2014 10:06 PM, Tim Miller wrote: Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then di

Re: sentence detector newline behavior

2014-01-24 Thread Jörn Kottmann
k at newline would be useful for at least some cTAKES users (and we could remove the cTAKES code that does that) -- James -Original Message- From: dev-return-2390-Masanz.James=mayo@ctakes.apache.org [mailto: dev-return-2390-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Jörn Kottmann Sen

Re: sentence detector newline behavior

2014-01-25 Thread Jörn Kottmann
On 01/25/2014 01:33 PM, Miller, Timothy wrote: Thanks Joern, I'll try it. My understanding is I just need to give it my training data, with the special character I used replaced with the literal string "" and each line in the file is an example sentence. Yes, exactly. Just thinking about the

Re: sentence detector newline behavior

2014-01-25 Thread Jörn Kottmann
On 01/25/2014 03:03 PM, Miller, Timothy wrote: I'm running into one issue, it gets tripped up on sentences with line-ending spaces. I could easily remove them with a script but by default they are in there. It happens when a sentence example ends: ...BILAT HEMATOMAS. (There is a period, then

Re: sentence detector newline behavior

2014-01-26 Thread Jörn Kottmann
On 01/25/2014 10:03 PM, Miller, Timothy wrote: On 01/25/2014 12:24 PM, Jörn Kottmann wrote: The code which computes the spans tries to remove white space from it. Removing the white space from a whitespace only sentence is causing the exception your are seeing. Which response would you expect

Re: sentence detector newline behavior

2014-01-27 Thread Jörn Kottmann
On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the character is replaced with the \n character. So tes

Re: sentence detector newline behavior

2014-01-29 Thread Jörn Kottmann
On 01/27/2014 08:44 PM, Tim Miller wrote: That is a good point, and something I was wondering about. Having now looked at both the ctakes and opennlp code for the sentence splitter it seems like there is a lot of overlap. I would've thought it was just a matter of converting annotations into

Re: sentence detector newline behavior

2014-01-29 Thread Jörn Kottmann
On 01/27/2014 03:52 PM, Tim Miller wrote: OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I am currently worki

Re: Brat

2014-02-11 Thread Jörn Kottmann
On 02/07/2014 07:49 PM, William Karl Thompson wrote: OpenNLP has some code that takes brat annotation files and creates "BratAnnotation" object instances. I've taken the code and modified it (simplified in some ways) to generate cTAKES annotations, using a "BratAnnotator" analysis engine that r

Re: ClearTK 2.0 upgrade

2014-06-03 Thread Jörn Kottmann
On 06/02/2014 11:31 PM, Steven Bethard wrote: Very likely, as the changes in package names mean that old classifiers will probably not deserialize. Yet another reason we probably shouldn't be using Java's object serialization, but I don't know of any alternative that actually solves the versionin