I completely agree with making cTAKES easier use. I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier). I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense. Hopefully it should attract new users, adopters, and perhaps more committers.
> i) Make the typesystem forefront in documentation -- generate javadocs and > have as a link on the ctakes frontpage/sidebar > ii) Similar to the way that we are aiming to have tests in every module, also > have clearly labeled examples in every module that set up a pipeline, run on > sample notes (could be the same sample notes from the tests), and do > something with the results. > iii) Follow Giri's recommendation to have example training data for people > who want to take the next step and train their own models I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos; So the examples could initially shield the complexity of wiring a pipeline together etc. If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards. --Pei > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Friday, June 28, 2013 8:00 AM > To: dev@ctakes.apache.org > Subject: Re: Next cTAKES release (3.1)? > > Very interesting discussion. I think Giri is right about giving example > training > data in the format that our training code can read. While our ultimate goal > would be to build and release models that are completely domain- > independent, in the real world it is almost always better to use some > domain-specific data and we should think more about how to facilitate that. > > As for making it easier to get started, it is not totally clear to me what > this > means/how to do it so it might be useful to get specific about what this > means. I think our biggest hurdle is > > 1) Prerequisite of understanding UIMA/UIMAFit > > Since UIMAFit is officially becoming part of UIMA that will be easier, and > hopefully people will just learn the easier (in my opinion) UIMAFit way than > the standard UIMA way of doing things. Is there something we can be doing > to make understanding UIMA easier? Or do we just need to say upfront that > this is a prerequisite and hope that people don't give up due to this thing > that > is out of our control? > > Another hurdle is: > > 2) cTAKES is a multi-purpose developer-aimed tool > > So it's not just a matter of hiding complexity -- at some point people have to > understand their problem, understand cTAKES' capabilities, and start coding. > Pei's GUI will help for some common use cases but will not remove the > requirement that someone at the organization knows cTAKES. > I think one part of this problem is the fact that the typesystem is not well > documented. A developer needs to know what the output is (objects from > the typesystem), how to get them (which modules/pipelines), and what > information is in them. So maybe on this end my recommendation would be: > i) Make the typesystem forefront in documentation -- generate javadocs and > have as a link on the ctakes frontpage/sidebar > ii) Similar to the way that we are aiming to have tests in every module, also > have clearly labeled examples in every module that set up a pipeline, run on > sample notes (could be the same sample notes from the tests), and do > something with the results. > iii) Follow Giri's recommendation to have example training data for people > who want to take the next step and train their own models > > This is quite a bit of developer overhead, so it's worth asking whether you > agree with my "diagnosis" and "treatment" or whether you think there are > different problems/solutions that should be higher priority. > > Tim > > On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote: > > Hi Vijay and Andy, > > > > Thanks for sharing those examples. > > > > "Trouble is, privacy requires that these examples be made up by hand" > > > > Agree with this statement and this is very valid concern. > > > > In "getting started examples", I think we should just have couple of > > entries (5-10 small entries), not more than that (with explicit > > statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I > understand > > handcrafting these may not be easy because we are not medical domain > > experts, but I feel worth time, because it brings in more user community. > > > > Thank you, > > Giri > > > > > > > > > > > > On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry > <mcmurry.a...@gmail.com>wrote: > > > >> GREAT ! > >> > >> The i2b2 data though isn't publicly distributable, you still need to > >> request access to it since it is "semi private" > >> > >> > >> On Jun 27, 2013, at 9:52 PM, vijay garla <vnga...@gmail.com> wrote: > >> > >>> We released code on using cTAKES to annotate clinical text and SVMs > >>> that use the annotations to classify clinical text from the CMC 2007 > >>> and I2B2 > >>> 2008 challenges: > >>> > >>> We did the cmd 2007 with cTAKES 2.5: > >>> > >> > https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr > o > >> ducing_results_on_CMC_2007_challenge > >> <https://code.google.com/p/ytex/downloads/list> > >>> > >>> And the i2b2 2008 with the version of cTAKES distributed with the > >>> first version of ARC: > >>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 > >>> > >>> These are both publicly available datasets, and represent real-world > >>> problems (in general I believe when publishing a paper the code > >>> should be reproducible and made publicly available, but that's a different > issue). > >>> > >>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to > >>> upgrade these samples as well. > >>> > >>> Best, > >>> > >>> VJ > >>> > >>> > >>> > >>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry > >>> <mcmurry.a...@gmail.com > >>> wrote: > >>> > >>>> +1 suggestion for documenting many examples of "getting started" > >>>> +NLP > >>>> datasets. > >>>> > >>>> I have at least one we can use that was created by our lead > >>>> Pathologist > >>>> > >>>> > >> > https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas > >> es/train/traincase.xml > >>>> We should provide at least one sample for each domain. > >>>> Trouble is, privacy requires that these examples be made up by hand > >>>> and not copy-pasted from EMR systems. > >>>> > >>>> --Andy > >>>> > >>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < > >> girinamb...@gmail.com> > >>>> wrote: > >>>> > >>>>> +1 for this observation Andy! > >>>>> > >>>>> Lowering time will motive users in writing blogs about features, > >>>>> how > >> to, > >>>>> etc., which reduces core team work load on documentation. > >>>>> > >>>>> I have been trying to write a small "how to write standalone > >>>>> client for ctakes" with my experience (I saw at least 4 users > >>>>> posted similar > >>>> question > >>>>> in last 2 months), but not getting enough time because ctakes > >>>>> depends > >> on > >>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), > >>>>> most > >> of > >>>>> my spare time is being spent on juggling between these frameworks, > >>>> posting > >>>>> and browsing those forums, relating observations to ctakes code. I > >> think > >>>> we > >>>>> need to have some high level documentation about these (with links > >>>>> to corresponding forums). > >>>>> > >>>>> Above case is for developers (I think this will be more user base > >>>>> as > >>>> ctakes > >>>>> progress), for users I think documentation is lot better though > >>>>> some improvements need to be done. > >>>>> > >>>>> As a developer I felt tough with lack of sample training data (I > >>>>> am > >> still > >>>>> struggling in this area even though I browsed all relevant code), > >> though > >>>>> training class are there. I understood that there are licensing > >>>>> issues > >>>> with > >>>>> REAL data, but at least some hand made example sentences, which > >>>>> may not > >>>> be > >>>>> real but helps developers in understanding the type/structure of > >>>>> input TRAINING classes expecting. This way people who browse the > >>>>> code can > >>>> reverse > >>>>> engineer and develop their own models. Sorry if you guys feel this > >>>>> as novice issue, but I feel most of the developers will be novice > >>>>> when > >> they > >>>>> adopt a system and Machine Learning/NLP is ocean. Some > >>>>> documentation in this area will same lot of time for us. > >>>>> > >>>>> I wish there will be some activity in this area from ctakes core team. > >>>>> > >>>>> Thank you, > >>>>> Giri > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry > >>>>> <mcmurry.a...@gmail.com > >>>>> wrote: > >>>>> > >>>>>> ctakes is at a point where we have a LOT of features but it is > >>>>>> still > >>>> hard > >>>>>> to get started. > >>>>>> > >>>>>> Judging from the mailing lists a lot of how cTakes works is not > >> obvious > >>>>>> and requires hand holding. > >>>>>> This is very typical in early FOSS projects. > >>>>>> > >>>>>> Lowering the time to get invested in ctakes gets more users AND > >>>>>> better > >>>> bug > >>>>>> reports, FAQ, etc. > >>>>>> > >>>>>> thoughts? > >>>>>> --Andy > >>>>>> > >>>>>> > >>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < > >>>> pei.c...@childrens.harvard.edu> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> I just wanted to gauge the interest of creating the next release > >>>>>>> of > >>>>>> cTAKES (3.1) which is currently marked for May in Jira- > >>>>>>> There have already been 22/53 issues [1] marked as fixed or closed. > >>>>>> Plenty of bug fixes and new components including: > >>>>>>> - New CEM Instance Template population > >>>>>>> - New Dependency Parser/Semantic Role Labeler > >>>>>>> - New optional Clear POSTagger > >>>>>>> - New regression testing component > >>>>>>> > >>>>>>> Should we wait for the Temporal component? > >>>>>>> > >>>>>>> [1] > >> > https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1% > >> 22%20AND%20project%20%3D%20CTAKES > >>>>>> > >>>> > >>