Re: HSQLDB out of memory with custom dictionary

Kathy Ferro Fri, 13 Oct 2017 12:27:29 -0700

Gandhi,

Thanks again for your response.


I am pretty new with ctakes myself and my Java knowledge is not up to
dated.

I am looking at the sample source code from https://github.com/healthnlp/
examples/tree/master/ctakes-temporal-demo.  In pipeline.java, it looks like
it changes the dictionary name only.

       builder.add( AnalysisEngineFactory.createEngineDescription(
DefaultJCasTermAnnotator.class,
                AbstractJCasTermAnnotator.PARAM_WINDOW_ANNOT_KEY,
                "org.apache.ctakes.typesystem.type.textspan.Sentence",
                JCasTermAnnotator.DICTIONARY_DESCRIPTOR_KEY,
                "org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab.xml")
       );


1. Do I change to MySQL driver in (dictionary).xml? Below is the code
snip.
2, What do I do with the blue highlight?
3. If I leave hsqldb, would that just use the hsqldb script file?
4. If I change it, do you have sample?

Right now, I run the pipeline using the new dictionary with this option "-l
org/apache/ctakes/dictionary/lookup/fast/(dictionary name).xml" which loads
the dictionary into hsqldb memory.


         <property key="jdbcDriver" value="org.hsqldb.jdbcDriver"/>
         <property key="jdbcUrl" value="jdbc:hsqldb:file
:src/main/resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab/sno_rx_16ab"/>


I'm very appreciated your help.
Kathy



On Wed, Oct 11, 2017 at 5:14 PM, Kathy Ferro <[email protected]>
wrote:

> Gandhi and Matthew,
>
> Thank you for the information.
>
> Kathy
>
> On Wed, Oct 11, 2017 at 1:35 AM, Gandhi Rajan Natarajan <
> [email protected]> wrote:
>
>> Hi Matthew,
>>
>> Please check out my response to Kathy. If feel that has the required info
>> to start off. Please let me know if you are looking for any specific
>> additional info.
>>
>> Regards,
>> Gandhi
>>
>>
>> -----Original Message-----
>> From: Matthew Vita [mailto:[email protected]]
>> Sent: Wednesday, October 11, 2017 11:00 AM
>> To: [email protected]
>> Subject: Re: HSQLDB out of memory with custom dictionary
>>
>> Hi Kathy and Gandhi,
>>
>> I started to put together a more formal solution for this here:
>> https://github.com/GoTeamEpsilon/cTAKES-HSQLDB-to-MySQL-Dictionary - It
>> is not perfect but it makes things a bit easier. I was able to load in
>> millions of records into MySQL, which is awesome!
>>
>> *If you have a non-trivial dictionary, chances are you will exhaust
>> HSQLDB's capabilities. By using this solution, you will have a MySQL schema
>> filled up with what would have been the HSQLDB data.*
>>
>> *This solution uses lazy lists and streams to keep memory usage low when
>> the script files are huge.*
>>
>> I have not got it working with the XML jdbc configuration yet so if you
>> (or anyone else) could share an example that would be amazing.
>>
>> Thanks,
>>
>> Matthew Vita
>> www.matthewvita.com
>>
>> On Tue, Oct 10, 2017 at 9:57 PM, Gandhi Rajan Natarajan <
>> [email protected]> wrote:
>>
>> > Hi Kathy,
>> >
>> > Good to hear from you. Please find the response below.
>> >
>> > NOTE: This is based on my experience with cTAKES so far. Please
>> > correct me if someone find the answers to be wrong.
>> >
>> > 1. Does it matter what the name of the database?
>> >
>> > Name of the database really don’t matter. But the name you have
>> > created should be mapped in the Dictionary GUI generated XML file's
>> 'jdbcurl'
>> > property.
>> >
>> > 2. What configuration file do I change to switch to use the new
>> database?
>> >
>> > If you are using the example downloaded from
>> > https://github.com/healthnlp/
>> > examples/tree/master/ctakes-temporal-demo , then in Pipeline.java you
>> > gotta map the XML file name generated using the Dictionary GUI instead
>> of 'sno_rx_16ab.xml'
>> >
>> > If you want to use the new database for CVD, then you got to change '
>> > DEFAULT_DICT_DESC_PATH' to point to the new XML file in
>> > JCasTermAnnotator.java and rebuild ctakes-dictionary-lookup-fast
>> > module and use the jar file.
>> >
>> > 3) Do you think I can use SQL server instead of MySQL?  My SQL seems
>> > to run faster.
>> >
>> > This choice is user specific and I can't comment on performance
>> > comparison as I have no clue on this.
>> >
>> >
>> >
>> > Regards,
>> > Gandhi
>> >
>> >
>> > -----Original Message-----
>> > From: Kathy Ferro [mailto:[email protected]]
>> > Sent: Tuesday, October 10, 2017 9:26 PM
>> > To: [email protected]
>> > Subject: Re: HSQLDB out of memory with custom dictionary
>> >
>> > Gandhi,
>> >
>> > My name is Kathy Ferro.
>> >
>> > Matthew and I are trying to accomplish the thing.  I got the scripts
>> > loaded into both SQL server and MySQL.  I did it in two ways.
>> > 1. Manually modifier the scripts for DB specific and run them in query
>> > analyzer window as you described.  Works find if the data is small
>> enough.
>> > For bigger file, it looks up.
>> > 2. I wrote c# program to read the scripts and insert records one by
>> > one I re-load them.
>> >
>> > My question for you are:
>> >
>> > 2. What configuration file do I change to switch to use the new
>> database?
>> > 3. Do you think I can use SQL server instead of MySQL?  My SQL seems
>> > to run faster.
>> >
>> > Thank
>> > Kathy
>> >
>> >
>> >
>> >
>> > On Tue, Oct 10, 2017 at 2:34 AM, Gandhi Rajan Natarajan <
>> > [email protected]> wrote:
>> >
>> > > Hi Matthew,
>> > >
>> > > The SQLs looks fine. The only additional table I'm using apart from
>> > > the tables mentioned below is MDR table (MEDDRA related) and I don’t
>> > > use AIR table.
>> > >
>> > > Do you really think you need a JAVA program to convert those insert
>> > > statements to work with MySQL? I just opened the script file in text
>> > > editor like Editplus and did a find for `[\)]\n` and replaced it
>> > > with `);\n` using find and replace all option with REGEX and we are
>> > > done with
>> > the scripts.
>> > >
>> > > But only thing is you can load the data in parallel by splitting the
>> > > script files as mentioned earlier which saves times for you and may
>> > > be you can write a JAVA program to split the file. This is the
>> > > easiest approach I feel.
>> > >
>> > > Regards,
>> > > Gandhi
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Matthew Vita [mailto:[email protected]]
>> > > Sent: Tuesday, October 10, 2017 10:47 AM
>> > > To: [email protected]
>> > > Subject: Re: HSQLDB out of memory with custom dictionary
>> > >
>> > > Gandhi,
>> > >
>> > > I really appreciate this information. I have started working out the
>> > > schema and plan on writing a program that will automatically prepare
>> > > a script to work with MySQL. Work in progress. Can you do a quick
>> > > review of my MySQL schema so far?
>> > >
>> > > CREATE SCHEMA CTAKES_DATA;
>> > >
>> > > use CTAKES_DATA;
>> > >
>> > > CREATE TABLE CUI_TERMS (
>> > >   CUI BIGINT NOT NULL,
>> > >   RINDEX INT(128) NOT NULL,
>> > >   TCOUNT INT(128) NOT NULL,
>> > >   TEXT VARCHAR(255) NOT NULL,
>> > >   RWORD VARCHAR(48) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_CUI_TERMS ON CUI_TERMS (RWORD);
>> > >
>> > > CREATE TABLE TUI (
>> > >   CUI BIGINT NOT NULL,
>> > >   TUI INT(128) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_TUI ON TUI (CUI);
>> > >
>> > > CREATE TABLE PREFTERM (
>> > >   CUI BIGINT NOT NULL,
>> > >   PREFTERM VARCHAR(511) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_PREFTERM ON PREFTERM (CUI);
>> > >
>> > > CREATE TABLE RXNORM (
>> > >   CUI BIGINT NOT NULL,
>> > >   RXNORM BIGINT NOT NULL
>> > > );
>> > > CREATE INDEX IDX_RXNORM ON RXNORM (CUI);
>> > >
>> > > CREATE TABLE SNOMEDCT_US (
>> > >   CUI BIGINT NOT NULL,
>> > >   SNOMEDCT_US BIGINT NOT NULL
>> > > );
>> > > CREATE INDEX IDX_SNOMEDCT_US ON SNOMEDCT_US (CUI);
>> > >
>> > > Quick question: do you use the AIR table?
>> > >
>> > > Thanks,
>> > >
>> > > Matthew Vita
>> > > www.matthewvita.com
>> > >
>> > > On Mon, Oct 9, 2017 at 1:14 AM, Gandhi Rajan Natarajan <
>> > > [email protected]> wrote:
>> > >
>> > > > Hi Mathew,
>> > > >
>> > > > First I would like to tell you that even I m a newbie in cTAKES.
>> > > > Unfortunately I don’t find any documentation on this. I have
>> > > > followed a crude way to accomplish as this is an one time activity.
>> > > > This is what
>> > > I did:
>> > > >
>> > > > 1) Used dictionary generator GUI to generate Snomed, RxNorm and
>> > > > MEDDRA dictionary data that resulted in '.script' file under my
>> > > > <ctakes_home>\resources\org\apache\ctakes\dictionary\lookup\fast\<
>> > > > pr
>> > > > oj
>> > > > ect_name>
>> > > > folder
>> > > > 2) The '.script' file has HSQLDB specific queries. I have removed
>> > > > the unwanted statements for me pertaining to HSQLDB from the file
>> > > > and converted them to mysql specific queries manually.
>> > > > 3) I have added semicolons at the end of each line in the script
>> > > > using text editor and splitted the file in to five parts. Then I
>> > > > ran those five sctipr files  in five different mysql command
>> > > > lines. It took me approximately 4 hours to pump all the data in to
>> MySQL DB.
>> > > >
>> > > > I'm not sure whether it is the right way to proceed as I mentioned
>> > > > earlier. But with no documentation available for MySQL DB with
>> > > > cTAKES, this is the approached that worked for me. Hope it will be
>> > > helpful.
>> > > >
>> > > > Regards,
>> > > > Gandhi
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Matthew Vita [mailto:[email protected]]
>> > > > Sent: Monday, October 09, 2017 10:41 AM
>> > > > To: [email protected]
>> > > > Subject: Re: HSQLDB out of memory with custom dictionary
>> > > >
>> > > > Gandhi,
>> > > >
>> > > > Thank you for the reply. Do you have any documentation on how to
>> > > > accomplish this?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Matthew Vita
>> > > > www.matthewvita.com
>> > > >
>> > > > On Sun, Oct 8, 2017 at 3:14 AM, Gandhi Rajan Natarajan <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hi Mathew,
>> > > > >
>> > > > > I feel using MySQL Db would be better idea than using in-memory
>> > > > > HSQLDB. In fact, this also comes handy when you are planning to
>> > > > > deploy ctakes as a web application as in our case.
>> > > > >
>> > > > > Regards,
>> > > > > Gandhi
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Matthew Vita [mailto:[email protected]]
>> > > > > Sent: Sunday, October 08, 2017 6:02 AM
>> > > > > To: [email protected]
>> > > > > Subject: HSQLDB out of memory with custom dictionary
>> > > > >
>> > > > > Hi Sean, Tim, cTAKES Community,
>> > > > >
>> > > > > I have put together what I am considering a pretty standard
>> > > > > dictionary with sources from the following:
>> > > > >
>> > > > >
>> > > > >    -
>> > > > >
>> > > > >    MEDLINEPLUS
>> > > > >    -
>> > > > >
>> > > > >    MSH
>> > > > >    -
>> > > > >
>> > > > >    NCI
>> > > > >    -
>> > > > >
>> > > > >    NDFRT
>> > > > >    -
>> > > > >
>> > > > >    CHV
>> > > > >    -
>> > > > >
>> > > > >    CSP
>> > > > >    -
>> > > > >
>> > > > >    ICPC2P
>> > > > >    -
>> > > > >
>> > > > >    MEDCIN
>> > > > >    -
>> > > > >
>> > > > >    SNOMED
>> > > > >    -
>> > > > >
>> > > > >    RXNORM
>> > > > >    -
>> > > > >
>> > > > >    ICD10
>> > > > >
>> > > > >
>> > > > > However, when copied over to cTAKES (handled by the handy
>> > > > > Dictionary Creator GUI) HSQLDB runs out of memory.
>> > > > >
>> > > > > This is my first experience with HSQLDB so you’ll have to excuse
>> > > > > my limited knowledge here. I do understand that it can run
>> > > > > either in-memory and on disk, but I’m not sure how to configure
>> this.
>> > > > >
>> > > > > Here is how I am connecting to it:
>> > > > >
>> > > > >
>> > > > >   <dictionary>
>> > > > >
>> > > > >
>> > > > >     <name>sno_rx_16abTerms</name>
>> > > > >
>> > > > >     <implementationName
>> > > > > >org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWor
>> > > > > >dD
>> > > > > >ic
>> > > > > >ti
>> > > > > >on
>> > > > > >ary</
>> > > > > implementationName>
>> > > > >
>> > > > >     <properties>
>> > > > >
>> > > > >       <property key="jdbcDriver" value="org.hsqldb.jdbcDriver"
>> > > > > />
>> > > > >
>> > > > >       <property key="jdbcUrl" value=
>> > > > > "jdbc:hsqldb:file:resources/org/apache/ctakes/dictionary/
>> > > > > lookup/fast/sno_rx_16ab/sno_rx_16ab"
>> > > > > />
>> > > > >
>> > > > >       <property key="jdbcUser" value="sa" />
>> > > > >
>> > > > >       <property key="jdbcPass" value="" />
>> > > > >
>> > > > >       <property key="rareWordTable" value="cui_terms" />
>> > > > >
>> > > > >       <property key="umlsUrl" value="
>> > > > > https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser"; />
>> > > > >
>> > > > >       <property key="umlsVendor" value="NLM-6515182895" />
>> > > > >
>> > > > >       <property key="umlsUser" value="CHANGE_ME" />
>> > > > >
>> > > > >       <property key="umlsPass" value="CHANGE_ME" />
>> > > > >
>> > > > >     </properties>
>> > > > >
>> > > > >   </dictionary>
>> > > > >
>> > > > >   <dictionary>
>> > > > >
>> > > > >
>> > > > >
>> > > > > Can I configure HSQLDB to be used on disk? If this is not a good
>> > > > > approach, can I spin up MySQL in its place?
>> > > > >
>> > > > >
>> > > > > Sorry if this has asked before.
>> > > > >
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Matthew Vita
>> > > > > www.matthewvita.com
>> > > > > This email and any files transmitted with it are confidential
>> > > > > and intended solely for the use of the individual or entity to
>> > > > > whom they are
>> > > > addressed.
>> > > > > If you are not the named addressee you should not disseminate,
>> > > > > distribute or copy this e-mail. Please notify the sender or
>> > > > > system manager by email immediately if you have received this
>> > > > > e-mail by mistake and delete this e-mail from your system. If
>> > > > > you are not the intended recipient you are notified that
>> > > > > disclosing, copying, distributing or taking any action in
>> > > > > reliance on the contents of this information is strictly
>> prohibited and against the law.
>> > > > >
>> > > > This email and any files transmitted with it are confidential and
>> > > > intended solely for the use of the individual or entity to whom
>> > > > they are
>> > > addressed.
>> > > > If you are not the named addressee you should not disseminate,
>> > > > distribute or copy this e-mail. Please notify the sender or system
>> > > > manager by email immediately if you have received this e-mail by
>> > > > mistake and delete this e-mail from your system. If you are not
>> > > > the intended recipient you are notified that disclosing, copying,
>> > > > distributing or taking any action in reliance on the contents of
>> > > > this information is strictly prohibited and against the law.
>> > > >
>> > > This email and any files transmitted with it are confidential and
>> > > intended solely for the use of the individual or entity to whom they
>> > > are
>> > addressed.
>> > > If you are not the named addressee you should not disseminate,
>> > > distribute or copy this e-mail. Please notify the sender or system
>> > > manager by email immediately if you have received this e-mail by
>> > > mistake and delete this e-mail from your system. If you are not the
>> > > intended recipient you are notified that disclosing, copying,
>> > > distributing or taking any action in reliance on the contents of
>> > > this information is strictly prohibited and against the law.
>> > >
>> > This email and any files transmitted with it are confidential and
>> > intended solely for the use of the individual or entity to whom they
>> are addressed.
>> > If you are not the named addressee you should not disseminate,
>> > distribute or copy this e-mail. Please notify the sender or system
>> > manager by email immediately if you have received this e-mail by
>> > mistake and delete this e-mail from your system. If you are not the
>> > intended recipient you are notified that disclosing, copying,
>> > distributing or taking any action in reliance on the contents of this
>> > information is strictly prohibited and against the law.
>> >
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you are not the named addressee you should not disseminate,
>> distribute or copy this e-mail. Please notify the sender or system manager
>> by email immediately if you have received this e-mail by mistake and delete
>> this e-mail from your system. If you are not the intended recipient you are
>> notified that disclosing, copying, distributing or taking any action in
>> reliance on the contents of this information is strictly prohibited and
>> against the law.
>>
>
>

Re: HSQLDB out of memory with custom dictionary

Reply via email to