It may or may not be helpful, but there's a similar class of problem that is frequently solved either by stored procedures or by running the query on a time-frame and storing the results... Doesn't matter if the end-point for the data is Solr or somewhere else.
The problem is long running queries that are extremely complex and stress the database performance too heavily. The solution is to de-normalize the data you need... store it in that form and then the query gets really fast... sort of like a data warehouse type of thing. (Don't shoot, I know this isn't data warehousing...) Postgres even has something called an "automatically updateable view" that might serve - if that's your back end. Anyway - the underlying strategy is to find a way to flatten your data preparatory to turning it into solr documents by some means - either by getting it out on shorter-running queries all the time into some kind of store (Kafka, text file, whatever) or by using some feature of the database (stored procs writing to a summary table, automatically updatable view or similar). In this way, when you make your query, you make it against the "flattened" data - which is, ideally, all in one table - and then all the complexity of joins etc... is washed away and things ought to run pretty fast. The cost, of course, is a huge table with tons of duplicated data... Only you can say if that's worth it. I did this at my last gig and we truncated the table every 2 weeks to prevent it growing forever. In case it's helpful... PS - if you have the resources, a duplicate database can really help here too - again my experience is mostly with Postgres which allows a "warm" backup to be live. We frequently used this for executive queries that were using the database like a data warehouse because they were so time-consuming. It kept the load off production. On Thu, May 26, 2016 at 12:18 PM, Erick Erickson <[email protected]> wrote: > Forgot to add... sometimes really hammering at the SQL query in DIH > can be fruitful, can you make a huge, monster query that's faster than > the sub-queries? > > I've also seen people run processes on the DB that move all the > data into a temporary place making use of all of the nifty stuff you > can do there and then use DIH on _that_. Or the view. > > All that said, I generally prefer using SolrJ if DIH doesn't do the job > after a day or two of fiddling, it gives more control. > > Good Luck! > Erick > > On Thu, May 26, 2016 at 11:02 AM, John Blythe <[email protected]> wrote: > > oo gotcha. cool, will make sure to check it out and bounce any related > > questions through here. > > > > thanks! > > > > best, > > > > > > -- > > *John Blythe* > > Product Manager & Lead Developer > > > > 251.605.3071 | [email protected] > > www.curvolabs.com > > > > 58 Adams Ave > > Evansville, IN 47713 > > > > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <[email protected] > > > > wrote: > > > >> Solr commits aren't the issue I'd guess. All the time is > >> probably being spent getting the data from MySQL. > >> > >> I've had some luck writing to Solr from a DB through a > >> SolrJ program, here's a place to get started: > >> searchhub.org/2012/02/14/indexing-with-solrj/ > >> you can peel out the Tika bits pretty easily I should > >> think. > >> > >> One technique I've used is to cache > >> some of the DB tables in Java's memory to keep > >> from having to do the secondary lookup(s). This only > >> really works if the "secondary table" is small enough to fit in > >> Java's memory of course. You can do some creative > >> things with caching partial tables if you can sort appropriately. > >> > >> Best, > >> Erick > >> > >> On Thu, May 26, 2016 at 9:01 AM, John Blythe <[email protected]> > wrote: > >> > hi all, > >> > > >> > i've got layered entities in my solr import. it's calling on some > >> > transactional data from a MySQL instance. there are two fields that > are > >> > used to then lookup other information from other tables via their > related > >> > UIDs, one of which has its own child entity w yet another select > >> statement > >> > to grab up more data. > >> > > >> > it fetches at about 120/s but processes at ~50-60/s. we currently only > >> have > >> > close to 500k records, but it's growing quickly and thus is becoming > >> > increasingly painful to make modifications due to the reimport that > needs > >> > to then occur. > >> > > >> > i feel like i'd seen some threads regarding commits of new data, > >> > master/slave, or solrcloud/sharding that could help in some ways > related > >> to > >> > this but as of yet can't scrounge them up w my searches (ironic :p). > >> > > >> > can someone help by pointing me to some good material related to this > >> sort > >> > of thing? > >> > > >> > thanks- > >> >
