Re: dictionary-look-fast fails to handle alternative CUIs

britt fitch Thu, 09 Jul 2015 05:55:52 -0700

I don’t think that is too much of a constraint, at least initially, to have all 
CUI values a consistent length for a given prefix.


Thanks Sean, let me know if there is any part of this you’d like a hand with.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 8, 2015, at 7:16 PM, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left 
> work right before I went to the store … but I’m now back to it and I’m going 
> to move forward with the tiny bot that I’ve got.  I don’t think that it will 
> take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” 
> won’t get converted to “CN0123456” by assuming that it is a seven digit 
> numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, 
> SEAN02, SEAN03…” through 99.  Then their output would still be formatted as 
> “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that 
> too much of a restraint?  Hmmm.  Well, I’m going to push forward with this 
> idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> <mailto:britt.fi...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related 
> to sort/search efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals 
> and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. 
> Ultimately CN123456 would become 0314123456 but I don’t think its 
> sophisticated enough to avoid issues with leading zeros. We could prepend a 9 
> to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>            discard 9
>            while (length > 8)
>                        convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair<Integer,String>(3, NLM) at Pair[0]
> Produce a Long of 0x10000000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/10000000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the 
> prefix in a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting 
> it out into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com 
> <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com
>  <mailto:britt.fi...@wiredinformatics.com>>
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean <sean.fi...@childrens.harvard.edu 
> <mailto:sean.fi...@childrens.harvard.edu><mailto:sean.fi...@childrens.harvard.edu
>  <mailto:sean.fi...@childrens.harvard.edu>>> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database 
> that we use has roughly half a million cuis.  Storing cuis in the various 
> tables as longs takes up a lot less space than storing them as 8 character 
> strings.
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> <mailto:britt.fi...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org 
> <mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org 
> <mailto:dev@ctakes.apache.org>>
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 
> numerals, padding with “0" as needed to reach that length when applicable 
> [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies 
> (MedGen) where the bulk of concepts use UMLS CUIs but some additional 
> concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, 
> resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are 
> running into this issue and if the community wants a solution to providing a 
> CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating 
> the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats 
> what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> britt.fi...@wiredinformatics.com 
> <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com
>  <mailto:britt.fi...@wiredinformatics.com>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: dictionary-look-fast fails to handle alternative CUIs

Reply via email to