Re: import efficiencies

John Bickerstaff Thu, 26 May 2016 11:31:42 -0700

It may or may not be helpful, but there's a similar class of problem that
is frequently solved either by stored procedures or by running the query on
a time-frame and storing the results...  Doesn't matter if the end-point
for the data is Solr or somewhere else.

The problem is long running queries that are extremely complex and stress
the database performance too heavily.

The solution is to de-normalize the data you need... store it in that form
and then the query gets really fast... sort of like a data warehouse type
of thing.  (Don't shoot, I know this isn't data warehousing...)

Postgres even has something called an "automatically updateable view" that
might serve - if that's your back end.

Anyway - the underlying strategy is to find a way to flatten your data
preparatory to turning it into solr documents by some means - either by
getting it out on shorter-running queries all the time into some kind of
store (Kafka, text file, whatever) or by using some feature of the database
(stored procs writing to a summary table, automatically updatable view or
similar).

In this way, when you make your query, you make it against the "flattened"
data - which is, ideally, all in one table - and then all the complexity of
joins etc... is washed away and things ought to run pretty fast.

The cost, of course, is a huge table with tons of duplicated data...  Only
you can say if that's worth it.  I did this at my last gig and we truncated
the table every 2 weeks to prevent it growing forever.

In case it's helpful...

PS - if you have the resources, a duplicate database can really help here
too - again my experience is mostly with Postgres which allows a "warm"
backup to be live.  We frequently used this for executive queries that were
using the database like a data warehouse because they were so
time-consuming.  It kept the load off production.

On Thu, May 26, 2016 at 12:18 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Forgot to add... sometimes really hammering at the SQL query in DIH
> can be fruitful, can you make a huge, monster query that's faster than
> the sub-queries?
>
> I've also seen people run processes on the DB that move all the
> data into a temporary place making use of all of the nifty stuff you
> can do there and then use DIH on _that_. Or the view.
>
> All that said, I generally prefer using SolrJ if DIH doesn't do the job
> after a day or two of fiddling, it gives more control.
>
> Good Luck!
> Erick
>
> On Thu, May 26, 2016 at 11:02 AM, John Blythe <j...@curvolabs.com> wrote:
> > oo gotcha. cool, will make sure to check it out and bounce any related
> > questions through here.
> >
> > thanks!
> >
> > best,
> >
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Solr commits aren't the issue I'd guess. All the time is
> >> probably being spent getting the data from MySQL.
> >>
> >> I've had some luck writing to Solr from a DB through a
> >> SolrJ program, here's a place to get started:
> >> searchhub.org/2012/02/14/indexing-with-solrj/
> >> you can peel out the Tika bits pretty easily I should
> >> think.
> >>
> >> One technique I've used is to cache
> >> some of the DB tables in Java's memory to keep
> >> from having to do the secondary lookup(s). This only
> >> really works if the "secondary table" is small enough to fit in
> >> Java's memory of course. You can do some creative
> >> things with caching partial tables if you can sort appropriately.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, May 26, 2016 at 9:01 AM, John Blythe <j...@curvolabs.com>
> wrote:
> >> > hi all,
> >> >
> >> > i've got layered entities in my solr import. it's calling on some
> >> > transactional data from a MySQL instance. there are two fields that
> are
> >> > used to then lookup other information from other tables via their
> related
> >> > UIDs, one of which has its own child entity w yet another select
> >> statement
> >> > to grab up more data.
> >> >
> >> > it fetches at about 120/s but processes at ~50-60/s. we currently only
> >> have
> >> > close to 500k records, but it's growing quickly and thus is becoming
> >> > increasingly painful to make modifications due to the reimport that
> needs
> >> > to then occur.
> >> >
> >> > i feel like i'd seen some threads regarding commits of new data,
> >> > master/slave, or solrcloud/sharding that could help in some ways
> related
> >> to
> >> > this but as of yet can't scrounge them up w my searches (ironic :p).
> >> >
> >> > can someone help by pointing me to some good material related to this
> >> sort
> >> > of thing?
> >> >
> >> > thanks-
> >>
>

Re: import efficiencies

Reply via email to