Re: HSQLDB out of memory with custom dictionary

Kathy Ferro Wed, 11 Oct 2017 14:15:00 -0700

Gandhi and Matthew,

Thank you for the information.


Kathy

On Wed, Oct 11, 2017 at 1:35 AM, Gandhi Rajan Natarajan <
[email protected]> wrote:

> Hi Matthew,
>
> Please check out my response to Kathy. If feel that has the required info
> to start off. Please let me know if you are looking for any specific
> additional info.
>
> Regards,
> Gandhi
>
>
> -----Original Message-----
> From: Matthew Vita [mailto:[email protected]]
> Sent: Wednesday, October 11, 2017 11:00 AM
> To: [email protected]
> Subject: Re: HSQLDB out of memory with custom dictionary
>
> Hi Kathy and Gandhi,
>
> I started to put together a more formal solution for this here:
> https://github.com/GoTeamEpsilon/cTAKES-HSQLDB-to-MySQL-Dictionary - It
> is not perfect but it makes things a bit easier. I was able to load in
> millions of records into MySQL, which is awesome!
>
> *If you have a non-trivial dictionary, chances are you will exhaust
> HSQLDB's capabilities. By using this solution, you will have a MySQL schema
> filled up with what would have been the HSQLDB data.*
>
> *This solution uses lazy lists and streams to keep memory usage low when
> the script files are huge.*
>
> I have not got it working with the XML jdbc configuration yet so if you
> (or anyone else) could share an example that would be amazing.
>
> Thanks,
>
> Matthew Vita
> www.matthewvita.com
>
> On Tue, Oct 10, 2017 at 9:57 PM, Gandhi Rajan Natarajan <
> [email protected]> wrote:
>
> > Hi Kathy,
> >
> > Good to hear from you. Please find the response below.
> >
> > NOTE: This is based on my experience with cTAKES so far. Please
> > correct me if someone find the answers to be wrong.
> >
> > 1. Does it matter what the name of the database?
> >
> > Name of the database really don’t matter. But the name you have
> > created should be mapped in the Dictionary GUI generated XML file's
> 'jdbcurl'
> > property.
> >
> > 2. What configuration file do I change to switch to use the new database?
> >
> > If you are using the example downloaded from
> > https://github.com/healthnlp/
> > examples/tree/master/ctakes-temporal-demo , then in Pipeline.java you
> > gotta map the XML file name generated using the Dictionary GUI instead
> of 'sno_rx_16ab.xml'
> >
> > If you want to use the new database for CVD, then you got to change '
> > DEFAULT_DICT_DESC_PATH' to point to the new XML file in
> > JCasTermAnnotator.java and rebuild ctakes-dictionary-lookup-fast
> > module and use the jar file.
> >
> > 3) Do you think I can use SQL server instead of MySQL?  My SQL seems
> > to run faster.
> >
> > This choice is user specific and I can't comment on performance
> > comparison as I have no clue on this.
> >
> >
> >
> > Regards,
> > Gandhi
> >
> >
> > -----Original Message-----
> > From: Kathy Ferro [mailto:[email protected]]
> > Sent: Tuesday, October 10, 2017 9:26 PM
> > To: [email protected]
> > Subject: Re: HSQLDB out of memory with custom dictionary
> >
> > Gandhi,
> >
> > My name is Kathy Ferro.
> >
> > Matthew and I are trying to accomplish the thing.  I got the scripts
> > loaded into both SQL server and MySQL.  I did it in two ways.
> > 1. Manually modifier the scripts for DB specific and run them in query
> > analyzer window as you described.  Works find if the data is small
> enough.
> > For bigger file, it looks up.
> > 2. I wrote c# program to read the scripts and insert records one by
> > one I re-load them.
> >
> > My question for you are:
> >
> > 2. What configuration file do I change to switch to use the new database?
> > 3. Do you think I can use SQL server instead of MySQL?  My SQL seems
> > to run faster.
> >
> > Thank
> > Kathy
> >
> >
> >
> >
> > On Tue, Oct 10, 2017 at 2:34 AM, Gandhi Rajan Natarajan <
> > [email protected]> wrote:
> >
> > > Hi Matthew,
> > >
> > > The SQLs looks fine. The only additional table I'm using apart from
> > > the tables mentioned below is MDR table (MEDDRA related) and I don’t
> > > use AIR table.
> > >
> > > Do you really think you need a JAVA program to convert those insert
> > > statements to work with MySQL? I just opened the script file in text
> > > editor like Editplus and did a find for `[\)]\n` and replaced it
> > > with `);\n` using find and replace all option with REGEX and we are
> > > done with
> > the scripts.
> > >
> > > But only thing is you can load the data in parallel by splitting the
> > > script files as mentioned earlier which saves times for you and may
> > > be you can write a JAVA program to split the file. This is the
> > > easiest approach I feel.
> > >
> > > Regards,
> > > Gandhi
> > >
> > >
> > > -----Original Message-----
> > > From: Matthew Vita [mailto:[email protected]]
> > > Sent: Tuesday, October 10, 2017 10:47 AM
> > > To: [email protected]
> > > Subject: Re: HSQLDB out of memory with custom dictionary
> > >
> > > Gandhi,
> > >
> > > I really appreciate this information. I have started working out the
> > > schema and plan on writing a program that will automatically prepare
> > > a script to work with MySQL. Work in progress. Can you do a quick
> > > review of my MySQL schema so far?
> > >
> > > CREATE SCHEMA CTAKES_DATA;
> > >
> > > use CTAKES_DATA;
> > >
> > > CREATE TABLE CUI_TERMS (
> > >   CUI BIGINT NOT NULL,
> > >   RINDEX INT(128) NOT NULL,
> > >   TCOUNT INT(128) NOT NULL,
> > >   TEXT VARCHAR(255) NOT NULL,
> > >   RWORD VARCHAR(48) NOT NULL
> > > );
> > > CREATE INDEX IDX_CUI_TERMS ON CUI_TERMS (RWORD);
> > >
> > > CREATE TABLE TUI (
> > >   CUI BIGINT NOT NULL,
> > >   TUI INT(128) NOT NULL
> > > );
> > > CREATE INDEX IDX_TUI ON TUI (CUI);
> > >
> > > CREATE TABLE PREFTERM (
> > >   CUI BIGINT NOT NULL,
> > >   PREFTERM VARCHAR(511) NOT NULL
> > > );
> > > CREATE INDEX IDX_PREFTERM ON PREFTERM (CUI);
> > >
> > > CREATE TABLE RXNORM (
> > >   CUI BIGINT NOT NULL,
> > >   RXNORM BIGINT NOT NULL
> > > );
> > > CREATE INDEX IDX_RXNORM ON RXNORM (CUI);
> > >
> > > CREATE TABLE SNOMEDCT_US (
> > >   CUI BIGINT NOT NULL,
> > >   SNOMEDCT_US BIGINT NOT NULL
> > > );
> > > CREATE INDEX IDX_SNOMEDCT_US ON SNOMEDCT_US (CUI);
> > >
> > > Quick question: do you use the AIR table?
> > >
> > > Thanks,
> > >
> > > Matthew Vita
> > > www.matthewvita.com
> > >
> > > On Mon, Oct 9, 2017 at 1:14 AM, Gandhi Rajan Natarajan <
> > > [email protected]> wrote:
> > >
> > > > Hi Mathew,
> > > >
> > > > First I would like to tell you that even I m a newbie in cTAKES.
> > > > Unfortunately I don’t find any documentation on this. I have
> > > > followed a crude way to accomplish as this is an one time activity.
> > > > This is what
> > > I did:
> > > >
> > > > 1) Used dictionary generator GUI to generate Snomed, RxNorm and
> > > > MEDDRA dictionary data that resulted in '.script' file under my
> > > > <ctakes_home>\resources\org\apache\ctakes\dictionary\lookup\fast\<
> > > > pr
> > > > oj
> > > > ect_name>
> > > > folder
> > > > 2) The '.script' file has HSQLDB specific queries. I have removed
> > > > the unwanted statements for me pertaining to HSQLDB from the file
> > > > and converted them to mysql specific queries manually.
> > > > 3) I have added semicolons at the end of each line in the script
> > > > using text editor and splitted the file in to five parts. Then I
> > > > ran those five sctipr files  in five different mysql command
> > > > lines. It took me approximately 4 hours to pump all the data in to
> MySQL DB.
> > > >
> > > > I'm not sure whether it is the right way to proceed as I mentioned
> > > > earlier. But with no documentation available for MySQL DB with
> > > > cTAKES, this is the approached that worked for me. Hope it will be
> > > helpful.
> > > >
> > > > Regards,
> > > > Gandhi
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Matthew Vita [mailto:[email protected]]
> > > > Sent: Monday, October 09, 2017 10:41 AM
> > > > To: [email protected]
> > > > Subject: Re: HSQLDB out of memory with custom dictionary
> > > >
> > > > Gandhi,
> > > >
> > > > Thank you for the reply. Do you have any documentation on how to
> > > > accomplish this?
> > > >
> > > > Thanks,
> > > >
> > > > Matthew Vita
> > > > www.matthewvita.com
> > > >
> > > > On Sun, Oct 8, 2017 at 3:14 AM, Gandhi Rajan Natarajan <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi Mathew,
> > > > >
> > > > > I feel using MySQL Db would be better idea than using in-memory
> > > > > HSQLDB. In fact, this also comes handy when you are planning to
> > > > > deploy ctakes as a web application as in our case.
> > > > >
> > > > > Regards,
> > > > > Gandhi
> > > > >
> > > > > -----Original Message-----
> > > > > From: Matthew Vita [mailto:[email protected]]
> > > > > Sent: Sunday, October 08, 2017 6:02 AM
> > > > > To: [email protected]
> > > > > Subject: HSQLDB out of memory with custom dictionary
> > > > >
> > > > > Hi Sean, Tim, cTAKES Community,
> > > > >
> > > > > I have put together what I am considering a pretty standard
> > > > > dictionary with sources from the following:
> > > > >
> > > > >
> > > > >    -
> > > > >
> > > > >    MEDLINEPLUS
> > > > >    -
> > > > >
> > > > >    MSH
> > > > >    -
> > > > >
> > > > >    NCI
> > > > >    -
> > > > >
> > > > >    NDFRT
> > > > >    -
> > > > >
> > > > >    CHV
> > > > >    -
> > > > >
> > > > >    CSP
> > > > >    -
> > > > >
> > > > >    ICPC2P
> > > > >    -
> > > > >
> > > > >    MEDCIN
> > > > >    -
> > > > >
> > > > >    SNOMED
> > > > >    -
> > > > >
> > > > >    RXNORM
> > > > >    -
> > > > >
> > > > >    ICD10
> > > > >
> > > > >
> > > > > However, when copied over to cTAKES (handled by the handy
> > > > > Dictionary Creator GUI) HSQLDB runs out of memory.
> > > > >
> > > > > This is my first experience with HSQLDB so you’ll have to excuse
> > > > > my limited knowledge here. I do understand that it can run
> > > > > either in-memory and on disk, but I’m not sure how to configure
> this.
> > > > >
> > > > > Here is how I am connecting to it:
> > > > >
> > > > >
> > > > >   <dictionary>
> > > > >
> > > > >
> > > > >     <name>sno_rx_16abTerms</name>
> > > > >
> > > > >     <implementationName
> > > > > >org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWor
> > > > > >dD
> > > > > >ic
> > > > > >ti
> > > > > >on
> > > > > >ary</
> > > > > implementationName>
> > > > >
> > > > >     <properties>
> > > > >
> > > > >       <property key="jdbcDriver" value="org.hsqldb.jdbcDriver"
> > > > > />
> > > > >
> > > > >       <property key="jdbcUrl" value=
> > > > > "jdbc:hsqldb:file:resources/org/apache/ctakes/dictionary/
> > > > > lookup/fast/sno_rx_16ab/sno_rx_16ab"
> > > > > />
> > > > >
> > > > >       <property key="jdbcUser" value="sa" />
> > > > >
> > > > >       <property key="jdbcPass" value="" />
> > > > >
> > > > >       <property key="rareWordTable" value="cui_terms" />
> > > > >
> > > > >       <property key="umlsUrl" value="
> > > > > https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser"; />
> > > > >
> > > > >       <property key="umlsVendor" value="NLM-6515182895" />
> > > > >
> > > > >       <property key="umlsUser" value="CHANGE_ME" />
> > > > >
> > > > >       <property key="umlsPass" value="CHANGE_ME" />
> > > > >
> > > > >     </properties>
> > > > >
> > > > >   </dictionary>
> > > > >
> > > > >   <dictionary>
> > > > >
> > > > >
> > > > >
> > > > > Can I configure HSQLDB to be used on disk? If this is not a good
> > > > > approach, can I spin up MySQL in its place?
> > > > >
> > > > >
> > > > > Sorry if this has asked before.
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Matthew Vita
> > > > > www.matthewvita.com
> > > > > This email and any files transmitted with it are confidential
> > > > > and intended solely for the use of the individual or entity to
> > > > > whom they are
> > > > addressed.
> > > > > If you are not the named addressee you should not disseminate,
> > > > > distribute or copy this e-mail. Please notify the sender or
> > > > > system manager by email immediately if you have received this
> > > > > e-mail by mistake and delete this e-mail from your system. If
> > > > > you are not the intended recipient you are notified that
> > > > > disclosing, copying, distributing or taking any action in
> > > > > reliance on the contents of this information is strictly
> prohibited and against the law.
> > > > >
> > > > This email and any files transmitted with it are confidential and
> > > > intended solely for the use of the individual or entity to whom
> > > > they are
> > > addressed.
> > > > If you are not the named addressee you should not disseminate,
> > > > distribute or copy this e-mail. Please notify the sender or system
> > > > manager by email immediately if you have received this e-mail by
> > > > mistake and delete this e-mail from your system. If you are not
> > > > the intended recipient you are notified that disclosing, copying,
> > > > distributing or taking any action in reliance on the contents of
> > > > this information is strictly prohibited and against the law.
> > > >
> > > This email and any files transmitted with it are confidential and
> > > intended solely for the use of the individual or entity to whom they
> > > are
> > addressed.
> > > If you are not the named addressee you should not disseminate,
> > > distribute or copy this e-mail. Please notify the sender or system
> > > manager by email immediately if you have received this e-mail by
> > > mistake and delete this e-mail from your system. If you are not the
> > > intended recipient you are notified that disclosing, copying,
> > > distributing or taking any action in reliance on the contents of
> > > this information is strictly prohibited and against the law.
> > >
> > This email and any files transmitted with it are confidential and
> > intended solely for the use of the individual or entity to whom they are
> addressed.
> > If you are not the named addressee you should not disseminate,
> > distribute or copy this e-mail. Please notify the sender or system
> > manager by email immediately if you have received this e-mail by
> > mistake and delete this e-mail from your system. If you are not the
> > intended recipient you are notified that disclosing, copying,
> > distributing or taking any action in reliance on the contents of this
> > information is strictly prohibited and against the law.
> >
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you are not the named addressee you should not disseminate, distribute
> or copy this e-mail. Please notify the sender or system manager by email
> immediately if you have received this e-mail by mistake and delete this
> e-mail from your system. If you are not the intended recipient you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited and
> against the law.
>

Re: HSQLDB out of memory with custom dictionary

Reply via email to