Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.
I had been thinking of something similar with parsing out the non-numerals and
converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately
CN123456 would become 0314123456 but I don’t think its sophisticated enough to
avoid issues with leading zeros. We could prepend a 9 to it to avoid losing
digits and use something like:
if(length>8 && begins with 9)
discard 9
while (length > 8)
convert first 2 numbers to a letter
I think your suggestion sounds good to me. To run the example through:
“NLM300" gets parsed to “NLM” + “300”
Store Pair<Integer,String>(3, NLM) at Pair[0]
Produce a Long of 0x10000000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/10000000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300
In that case, do we need to store it as a Pair at all or is just storing the
prefix in a String[] sufficient?
I’m happy to start working on this unless you have a preference for splitting
it out into multiple tasks.
Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
[email protected]
> On Jul 8, 2015, at 2:54 PM, Finan, Sean <[email protected]>
> wrote:
>
> By the way, in case you are wondering why it does this … the umls database
> that we use has roughly half a million cuis. Storing cuis in the various
> tables as longs takes up a lot less space than storing them as 8 character
> strings.
>
> From: britt fitch [mailto:[email protected]]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: [email protected]
> Subject: dictionary-look-fast fails to handle alternative CUIs
>
> This is largely directed to Sean but open to other feedback as well.
>
> The current fast lookup using a BSV parses the first field as “C” and up to 7
> numerals, padding with “0" as needed to reach that length when applicable
> [see CuiCodeUtil.getCuiCode(String)]
>
> The CUI string is then substring’d from 1 to len and parsed as a Long.
>
> This is producing issues with other related, but separate, ontologies
> (MedGen) where the bulk of concepts use UMLS CUIs but some additional
> concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals,
> resulting in “N123456” failing to produce a Long.
>
> I wanted Sean’s thoughts on this and to get some feedback on if others are
> running into this issue and if the community wants a solution to providing a
> CUI format beyond the standard C + 7 numerals.
>
> I’m happy to make these edits and check them in whether that means updating
> the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats
> what makes the most sense.
>
> Thoughts?
>
>
>
>
>
>
>
>
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> [email protected]
>
signature.asc
Description: Message signed with OpenPGP using GPGMail
