Sishir,

I believe our main table has about half a million rows, which isn't a lot
but it has multiple dependent tables, several levels deep. The resulting XML
files were about 1 GB in total, split into around 15 files. We could feed
these files one at a time into Solr in as little as a few seconds per file
(tens of seconds on a slow machine), much less that the database export
actually.

In your case, it may be the join that is slowing things down in the DIH.
Depending on your schema, you *may* be able to write the DIH query
differently, or you could create a [materialized] view and use it in the DIH
query.

Alain

On Tue, Oct 25, 2011 at 10:50 PM, Awasthi, Shishir <shishir.awas...@baml.com
> wrote:

> Alain,
> How many rows did you export in this fashion and what was the
> performance?
>
> We do have oracle as underlying database with data obtained from
> multiple tables. The data is only 1 level deep except for one table
> where we need to traverse hierarchy to get information.
>
> How many XML files did you feed into SOLR one at a time?
>
> Shishir
>
> -----Original Message-----
> From: Alain Rogister [mailto:alain.rogis...@gmail.com]
> Sent: Tuesday, October 25, 2011 4:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Loading data to SOLR first time ( taking too long)
>
> Are you loading data from multiple tables ? How many levels deep ? After
> some experimenting, I gave up on the DIH because I found it to generate
> very chatty (one row at a time) SQL against my schema, and I experienced
> concurrency bugs unless multithreading was set to false, and I wasn't
> too confident in the incremental mode against a complex schema.
>
> Here is what worked for us (with Oracle):
>
> - create materialized views; make sure that you include a
> 'lastUpdateTime'
> field in the main table. This step may be unnecessary if your source
> data does not need any pre-processing / cleaning / reorganizing.
> - write a stored procedure that exports the data in Solr's XML format;
> parameterize it with a range of primary keys of your main table so that
> you can partition the export into manageable subsets. The XML format is
> very simple, no need for complex in-the-database XML functions to
> generate it.
> - use the database scheduler to run that procedure as a set of jobs; run
> a few of them in parallel.
> - use CURL or WGET or similar to feed the XML files into the index as
> soon as they are available.
> - compress and archive the XML files; they will come handy when you need
> to provision another index instance and will save you a lot of exporting
> time.
> - make sure your stored procedure can work in incremental mode: e.g.
> export all records updated after a certain timestamp; then just push the
> resulting XML into Solr.
>
> Alain
>
> On Tue, Oct 25, 2011 at 9:56 PM, Awasthi, Shishir
> <shishir.awas...@baml.com>wrote:
>
> > Hi,
> >
> > I recently started working on SOLR and loaded approximately 4 million
> > records to the solr using DataImportHandler. It took 5 days to
> > complete this process.
> >
> >
> >
> > Can you please suggest how this can be improved? I would like this to
> > be done in less than 6 hrs.
> >
> >
> >
> > Thanks,
> >
> > Shishir
> >
> > ----------------------------------------------------------------------
> > This message w/attachments (message) is intended solely for the use of
>
> > the intended recipient(s) and may contain information that is
> > privileged, confidential or proprietary. If you are not an intended
> > recipient, please notify the sender, and then please delete and
> > destroy all copies and attachments, and be advised that any review or
> > dissemination of, or the taking of any action in reliance on, the
> > information contained in or attached to this message is prohibited.
> > Unless specifically indicated, this message is not an offer to sell or
>
> > a solicitation of any investment products or other financial product
> > or service, an official confirmation of any transaction, or an
> > official statement of Sender. Subject to applicable law, Sender may
> > intercept, monitor, review and retain e-communications (EC) traveling
> > through its networks/systems and may produce any such EC to
> > regulators, law enforcement, in litigation and as required by law.
> > The laws of the country of each sender/recipient may impact the
> > handling of EC, and EC may be archived, supervised and produced in
> > countries other than the country in which you are located. This
> > message cannot be guaranteed to be secure or free of errors or
> viruses.
> >
> > References to "Sender" are references to any subsidiary of Bank of
> > America Corporation. Securities and Insurance Products: * Are Not FDIC
>
> > Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank
> > Deposit * Are Not a Condition to Any Banking Service or Activity * Are
>
> > Not Insured by Any Federal Government Agency. Attachments that are
> > part of this EC may have additional important disclosures and
> disclaimers, which you should read.
> > This message is subject to terms available at the following link:
> > http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender
>
> > you consent to the foregoing.
> >
>
> ----------------------------------------------------------------------
> This message w/attachments (message) is intended solely for the use of the
> intended recipient(s) and may contain information that is privileged,
> confidential or proprietary. If you are not an intended recipient, please
> notify the sender, and then please delete and destroy all copies and
> attachments, and be advised that any review or dissemination of, or the
> taking of any action in reliance on, the information contained in or
> attached to this message is prohibited.
> Unless specifically indicated, this message is not an offer to sell or a
> solicitation of any investment products or other financial product or
> service, an official confirmation of any transaction, or an official
> statement of Sender. Subject to applicable law, Sender may intercept,
> monitor, review and retain e-communications (EC) traveling through its
> networks/systems and may produce any such EC to regulators, law enforcement,
> in litigation and as required by law.
> The laws of the country of each sender/recipient may impact the handling of
> EC, and EC may be archived, supervised and produced in countries other than
> the country in which you are located. This message cannot be guaranteed to
> be secure or free of errors or viruses.
>
> References to "Sender" are references to any subsidiary of Bank of America
> Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are
> Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a
> Condition to Any Banking Service or Activity * Are Not Insured by Any
> Federal Government Agency. Attachments that are part of this EC may have
> additional important disclosures and disclaimers, which you should read.
> This message is subject to terms available at the following link:
> http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
> consent to the foregoing.
>

Reply via email to