Thanks for the details Sean. I had assumed the conversion to Long was related 
to sort/search efficiency but that makes sense.

I had been thinking of something similar with parsing out the non-numerals and 
converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately 
CN123456 would become 0314123456 but I don’t think its sophisticated enough to 
avoid issues with leading zeros. We could prepend a 9 to it to avoid losing 
digits and use something like:

if(length>8 && begins with 9)
        discard 9
        while (length > 8)
                convert first 2 numbers to a letter

I think your suggestion sounds good to me. To run the example through:

“NLM300" gets parsed to “NLM” + “300”
Store Pair<Integer,String>(3, NLM) at Pair[0]
Produce a Long of 0x10000000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/10000000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300

In that case, do we need to store it as a Pair at all or is just storing the 
prefix in a String[] sufficient?

I’m happy to start working on this unless you have a preference for splitting 
it out into multiple tasks.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 8, 2015, at 2:54 PM, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database 
> that we use has roughly half a million cuis.  Storing cuis in the various 
> tables as longs takes up a lot less space than storing them as 8 character 
> strings.
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 
> numerals, padding with “0" as needed to reach that length when applicable 
> [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies 
> (MedGen) where the bulk of concepts use UMLS CUIs but some additional 
> concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, 
> resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are 
> running into this issue and if the community wants a solution to providing a 
> CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating 
> the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats 
> what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com
> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to