Hi all,

Sorry for the late reply, quite a bit to digest. I will add some thoughts now and more later. (I may not be super responsive, I have to deal with a lot of email).

Looks like the topic of the thread is growing the community. It is indeed hard, I spent a ton of time on building strong communities for a few projects in the past. Something that is sometimes not well understood is that every open source project (ASF makes no exception) has actually two communities: users and developers. ASF recognizes that by providing separate mailing lists for the two. It is hard to grow dev@ without growing users@ and I would focus on that. Most of the ideas proposed only address dev@. Growing users@ means a few hard and unrewarding tasks: * clearly articulate the value proposition of the project, target audience and benefits (not features)
* easy to 'get started' (ideally under 5 min)
* easy to post feedback (comments/bug reporst)
* responsiveness in fixing issues
* and yes, documentation

Things like migration to git, in my experience, address the middle bullets. (However, I would think very hard about removing the binaries from the code base before moving to git).

Personally, I do have some good connections in healthcare and I may be able to help. I had a presentation at the OSEHRA Summit in June and I was surprised that almost nobody was aware of cTakes, but NLP was mentioned quite a bit. Last week I was at the HSPC meeting in Indianapolis where, again, very few knew about cTakes.

If the project is interested I could find more resources for the project and help with growing adoption.

Cheers,
Hadrian



On 11/20/2017 10:41 AM, Finan, Sean wrote:
Hi Alex,
Some great ideas, all of which are deserving of comment.

   - There is code commented out, but much of this code seems to still be
    valuable, like it was commented from some migrations and was left over for
    somebody to follow-up (e.g. unit tests).
True.  Some intelligence is required.  When in doubt, leave it - but there are a lot of 
things that are obviously moved or old rewritten code.  This is all volunteer and just 
getting people involved with "baby steps" would be great.  I would also hope 
that some inactive authors come back and clean up comments in their own code.  Or write 
those unit tests if that was the intention.  There are  TODO comments in the code that 
could be tackled.

   - There are issues reported by SonnarQube [1] like:
This should be handled with kid gloves.  A lot of those reports cover items 
that are not yet complete, ordered for easier following / understanding of 
code, etc.  However a lot can be handled easily and quickly, like adding 
@Override ...  People can use local plugins that check code like findBugs.  I 
used to be religious about it but have become lax.  This is a good reminder for 
me to start again.

   - Removal of hardcoded paths like: "/tmp",
I am in complete agreement.  Things like /tmp should probably even be 
refactored to use temp files.  Things like default paths used in static 
createAnnotatorDescription() should instead probably be used in 
@ConfigurationParameter default= ...
--- Building upon that statement, it would be nice to migrate older annotation 
engines, readers, and cas consumers to the uimafit paradigm.  This would help a 
newbie understand the difference and how to use AEs, etc.

   - Migrate scripts from Ant (files like build-*.xml) to maven.
Does ctakes have these?  I guess that I've missed them.  Yeah, full maven would 
be nice.

   - Deprecated code
We certainly have a lot of it.  It is a good excuse to make unit tests before 
updating.

   - I think it is time to define some conventions for:
       - formatting (identation),
       - crlf conventions (see .gitattributes)
       - etc
You are correct; indentation and crlf should also be settable by a decent ide 
for any cvs.  I think that most ctakes code is space indented, 3 per 
indentation, and \n only for newlines.  I could be wrong.
Things that are more stylistic (naming, ordering, etc.) are much more coder- 
preference.  I would rather have contributions than turn people off with 
strictures.  I'll even take things like missing { } ...  though there is 
another great target for refactoring ...

   - For git vs Subversion, I am able to use the same folder with a .git
Thanks for the documentation!  As an Apache project we would need to vote on 
fully moving to git (as Tim and Dave suggested).  I am definitely not opposed 
to that - I use github for everything else these days ...

   - There are commits without any reference to Jira issues or other type
Guilty as charged.  A lot of my commits are new development and I only write 
commit comments.  I could open a jira for each, but I am admittedly lazy about 
such things.  Ditto for placing links in an email appendix.

Also, based on the decision to use semantic versioning, it
    will need to choose between 4.0.1 or 4.1.0.
Personally I think that our next release should be 4.1.0 as there are enough 
new features to distinguish that it isn't just a patch release.
http://semver.org/

Thanks,
Sean

-----Original Message-----
From: Alexandru Zbarcea [mailto:al...@apache.org]
Sent: Monday, November 20, 2017 8:34 AM
To: Apache cTAKES Dev; Hadrian Zbarcea
Subject: Re: Contribute to ctakes: it is in your best interests! RE: unknown 
dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]

Hi,

To grow the community and bring even more adoption is my desire, too. I cannot 
agree more with what you said, Sean, Tim.

I have discussed with Hadrian (Apache member) about cTAKES adoption and I think 
he has great ideas about the priorities for this community to grow. I will like 
to introduce him to the community and let him express some ideas.

In regards to the technical issues that where already identified on this 
thread, I would like to understand your perspective and prioritization.

    - There is code commented out, but much of this code seems to still be
    valuable, like it was commented from some migrations and was left over for
    somebody to follow-up (e.g. unit tests).
    - There are issues reported by SonnarQube [1] like:
       - 3.3K bugs [2]
       - 16.5% code duplication (24K LoC) [3]
       174 bugs in the last month [4]


    - I would like to see more Unit Tests for the code. There are new
    commits unrelated to a feature description and so, there is no clear
    understanding about what the review should focus on. I think it relates to
    the same request from Sean to have "sanity-test type unit tests - Little
    two or three-line "does this method crack" tests.". I see this task as one
    of the most important one.
    - Removal of hardcoded paths like: "/tmp",
    "C:/Users/<some-user>/<some-path>.
    - Migrate scripts from Ant (files like build-*.xml) to maven. It makes
    the code so unpredictable. I find it difficult to navigate through these
    when tests are dependent upon these executions.
    - Classpaths manually specified.
    - Deprecated code
    - Old libraries which involve security risks in production (e.g. Spring
    that was just upgraded)

Other tasks that are related more to productivity.

    - I think it is time to define some conventions for:
       - formatting (identation),
       - crlf conventions (see .gitattributes)
       - etc
    - For git vs Subversion, I am able to use the same folder with a .git
    and .svn VCS and documented on the wiki [5].
    - There are commits without any reference to Jira issues or other type
    of documentation. In consequence, when release will come, it will be very
    hard to hunt those changes and understand why those commits were made: bugs
    vs features. Also, based on the decision to use semantic versioning, it
    will need to choose between 4.0.1 or 4.1.0.

My $0.02,
Alex

[1] -
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_analysis_overview-3Fid-3Dorg.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=ZBpW0OVPlYu308dmEv3E6DK93VfUe8NLi0OClLqa2Sk&e=
[2] -
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23resolved-3Dfalse-257Ctypes-3DBUG&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=Vot25EW4XwGjz9uLwHo4rc62shM_0n-6Yy5u9BjktsM&e=
[3] -
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_analysis_component-5Fmeasures_metric_duplicated-5Fblocks_list-3Fid-3Dorg.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=NKhS3KX3JBBiuFbfjPSq2WT-qibS-QSQzqkG8KbiLIk&e=
[4] -
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23resolved-3Dfalse-257Ctypes-3DBUG-257CsinceLeakPeriod-3Dtrue&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=tNsgiXoIKXQPQAzM7g-EEXEephKMNEG50OBl8iuD6lU&e=
[5] -
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_cTAKES-2B4.0-2BDeveloper-2BInstall-2BGuide-23cTAKES4.0DeveloperInstallGuide-2DSubversion-2BGit&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=lMZ18SEZob73AXp4a3sMrd22nHpwFtQ__4fR-Q5QQuI&e=



On Mon, Nov 20, 2017 at 6:32 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu> wrote:

Git is available to apache projects, and many projects have moved over
(see here: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dwip-2Dus.apache.org_repos_asf&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=qGV9tIcYJGK-tQAMYm5cWevWrBSixPCHj3VfaXum288&e=):
Here is the general info on what that looks like:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_de
v_writable-2Dgit&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeF
U&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySL
w2RPNP-d8XkCTXvOuP-YWuI&s=BRSYUV67HZtyxzLNbqPzAlS-YZmqUpA30rvPsNKX6i0&
e=

A few points from that link:
Projects can request moving to Git as their main code repository, by
creating an INFRA issue. See also the infra-contact page. > Projects
can request new, blank repositories by using reporeq.apache.org.
The current system has basic git support only. We are working on
extending this service in the near future.
Custom commit or other hooks will not be supported, all projects get
the
same hooks. Setting up gitpubsub should provide sufficient flexiblity
without impacting the core Git setup, volunteers are welcome to make
that happen.

(Not sure what basic support only means.)

There are also read-only git repos available by default for every
project and updated in near-real-time:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_de
v_git.html&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs
67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP
-d8XkCTXvOuP-YWuI&s=CtgGvLG2s_KqVRWx_tZAcaMSh_KKH4aqc6HGTP3dmtA&e=

with those I guess the suggested workflow is to work off of that repo
and then just submit patches to someone who commits with svn rather
than committing directly.

I've been using the git-svn connector myself recently since I just
vastly prefer the git lightweight branching for focused development,
as it helps me keep a cleaner working directory. But that adds some
additional annoying steps.

Tim

________________________________________
From: Finan, Sean <sean.fi...@childrens.harvard.edu>
Sent: Saturday, November 18, 2017 1:23 PM
To: dev@ctakes.apache.org
Subject: RE: Contribute to ctakes: it is in your best interests! RE:
unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]

Hi Dave,

Those are some great thoughts.  Being an apache project I am not sure
how far we can move from svn, but there may be a way.  You are not the
first to voice this desire for an active github repo and I'm sure that
you won't be the last.

I completely agree with your discussion board preference.  Do you have
any recommendations?

You make a great point regarding documentation.  In reference to
things that anybody can quickly contribute ... that would be a big one.
Volunteers?!?

I am really happy to hear that you want to contribute - more than you
already have, which is actually quite a bit!

Cheers,
Sean

-----Original Message-----
From: David Kincaid [mailto:kincaid.d...@gmail.com]
Sent: Saturday, November 18, 2017 1:10 PM
To: dev@ctakes.apache.org
Subject: Re: Contribute to ctakes: it is in your best interests! RE:
unknown dependencies [EXTERNAL] [SUSPICIOUS]

Sean, I can share a couple things that have been an obstacle for me.
It may seem a minor point to some, but I left Subversion behind years
ago and really have no desire to go back. If the project were moved
over to Git/Github it would really smooth the way for me at least. I
would be happy to help out with this. One of the other things I would
really like to see is the mailing list moved onto a discussion board
platform. It seems to me that a discussion board style of tool tends
to create a more active community than a mailing list does.

The other thing that might help get new people involved is making it
easier to find information about the development environment. Things
like branching strategies, coding conventions, etc are really hard to
find from the main cTAKES web site. I saw some references to Jenkins
builds recently on the list. I had no idea there was a Jenkins CI
server for the project somewhere. It also takes some digging to find a
link to Jira. Maybe we could create a Wiki page that describes where
all these tools are and how they are used.

You guys have really done some great work over the last couple of
years cleaning up the code base and improving the documentation by a
ton. Things like the fast dictionary annotator, dictionary creator GUI
are a great addition and make it a lot easier for other people to get
up and running more quickly. As I'm ramping up my research as well as
some proof of concept stuff at work I'll be working more and more with
cTAKES and would love to contribute more to the project.

Just my thoughts.

- Dave


On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

Hi Tim, Alex,

Great ideas.  I like your (Tim) idea to 1. start with commented code
removal.
Then maybe move on to
2. sanity-test type unit tests - Little two or three-line "does this
method crack" tests.
And another that is simply
3. "populate a test cas with type(s) X" and a factory with
"getSectionTestCas" "getSetenceTestCas" "getPosTestCas" "getChunkTestCas"
...  just really simple reusables for tests.
Then
4. refactor to extract and consolidate duplicate code - it is all
over the place ...

These are just my initial thoughts and suggestions, but I think that
those
4 tasks can be performed by anybody of any experience level.   They build
upon each other and should help the implementers better understand
ctakes.
After that the sky is the limit.

A couple of years ago I sat on a panel at a workshop for open source
scientific software.  For the half dozen or so highlighted projects
(ctakes was one!) the common thread was that getting people to
contribute is extremely difficult.
I have a tendency to assume that people always act in their best
interests.  Any student thinking of going towards industry should be
jumping at the opportunity to contribution to a large,
production-quality project.  They should also realize that
contribution means potential recommendation (and possibly hiring
interest) by established developers, physicians and researchers that
use ctakes.  Even just answering questions on a user or dev list
creates
credibility and can build a network.
Active researchers could discover common thoughts and directions
that could lead to collaboration outside ctakes.  Researchers and
companies trying to build upon open source should realize that
direct contribution is easier than custom substitution.  Plus, it is
in their best interests that code does what they need it to do in
the fastest, lightest, most stable way possible.
With a project like ctakes there are a lot of things that can be
done, there are great opportunities to really shine.  "I wrote this
tool for my thesis that performs some nlp task" sounds good.
Appending "in an Apache product and it has been taken up by thousands across the 
globe"
makes it sound a lot better.
At my previous job in industry the company actively contributed to
several open source projects.  We had a few people for whom that was
50% of their job.  Why?  Because we made a commitment to use that
open
source software.
It was a better use of our resources to contribute to it, improve it
and keep its momentum going and prevent it from becoming stale (or
abandoned) while our software continued to move forward.

Hmm, that was a touch more than I had planned to write.  A whole cup
of coffee in that one.

Sean




-----Original Message-----
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Saturday, November 18, 2017 8:13 AM
To: dev@ctakes.apache.org
Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS]

Thanks Alex, looks like that was probably a fat-fingered auto-import
on my part.

I like your idea, and I don't know the best way to to start either,
but maybe one suggestion is to start with one or two focused things
to clean up, and then ask for volunteers to take on specific modules?
Then people can contribute an hour here and there to do cleanup on
their task/module and try to fix that thing in a 1-2-month long
sprint. I am happy to contribute to cleanup, I am responsible for my
fair share of unclean code, but since I don't have strong software
engineering chops it would be good to have people with that
background propose the tasks and describe exactly what needs to be
done. My idea of cleaning is just to delete commented out sections of 
evaluation code.

Tim

________________________________________
From: Alexandru Zbarcea <al...@apache.org>
Sent: Friday, November 17, 2017 4:46 PM
To: Apache cTAKES Dev
Subject: unknown dependencies [EXTERNAL]

Hi,

I notice that a miss-dependency has slipped in the code:
jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter;

Now, that the Jenkins builds is successful, I think it is easier to
clean-up the code. I would like to be a common effort. I don't know
the best way to approach this.

Looking forward to your advice,
Alex


Reply via email to