Clayton you could also try running and optimize on the SOLR index as a weekly/bi weekly maintenance task to keep the segment count in check and the maxdoc , numdoc count as close as possible (in DB terms de-fragmenting the solr indexes)
Best Regards, Abhishek On Sun, May 15, 2016 at 7:18 PM, Pryor, Clayton J <cjpr...@sandia.gov> wrote: > Thank you for your feedback. I really appreciate you taking the time to > write it up for me (and hopefully others who might be considering the > same). My first thought for dealing with deleted docs was to delete the > contents and rebuild the index from scratch but my primary customer for the > deleted docs functionality wants to see it immediately. I wrote a > connector for transferring the contents of one Solr Index to another (I > call it a Solr connector) and that takes a half hour. As a side note, the > reason I have multiple indexes is because we currently have physical > servers for development and production but, as part of my effort, I am > transitioning us to new VMs for development, quality, and production. For > quality control purposes I wanted to be able to reset each with the same > set of data - thus the Solr connector. > > Yes, by connector I am talking about a Java program (using SolrJ) that > reads from the database and populates the Solr Index. For now I have had > our enterprise DBAs create a single table to hold the current index schema > fields plus some that I can think of that we might use outside of the > index. So far it is a completely flat structure so it will be easy to > index to Solr but I can see, as requirements change, we may have to have a > more sophisticated database (with multiple tables and greater > normalization) in which case the connector will have to flatten the data > for the Solr index. > > Thanks again, your response has been very reassuring! > > :) > > Clay > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, May 13, 2016 5:57 PM > To: solr-user > Subject: [EXTERNAL] Re: Does anybody crawl to a database and then index > from the database to Solr? > > Clayton: > > I think you've done a pretty thorough investigation, I think you're > spot-on. The only thing I would add is that you _will_ reindex your entire > corpus.... multiple times. Count on it. Sometime, somewhere, somebody will > say "gee, wouldn't it be nice if we could <insert new use-case here>". And > to support it you'll have to change your Solr schema... which will almost > certainly require you to re-index..... > > The other thing people have done for deleting documents is to create > triggers in your DB to insert the deleted doc IDs into, say, a "deleted" > table along with a timestamp. Whenever necessary/desirable, run a cleanup > task that finds all the IDs since the last time you ran your deleting > program to remove docs that have been flagged since then.. Obviously you > also have to keep a record around of the timestamp of the last successful > run of this program...... > > Or, frankly, since it takes so little time to rebuild from scratch people > have foregone any of that complexity and simply rebuild the entire index > periodically. You can use "collection aliasing" to do this in the > background and then switch searches atomically, it depends somewhat on how > long you can wait until you need to see (well, _not_ > see) the deleted docs. > > But this is all refinements, I think you're going down the right path. > > And when you say "connector", are you talking DIH or an external (say > SolrJ) program? > > Best, > Erick > > On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff < > j...@johnbickerstaff.com> wrote: > > I've been working on a less-complex thing along the same lines - > > taking all the data from our corporate database and pumping it into > > Kafka for long-term storage -- and the ability to "play back" all the > > Kafka messages any time we need to re-index. > > > > That simpler scenario has worked like a charm. I don't need to > > massage the data much once it's at rest in Kafka, so that was a > > straightforward solution, although I could have gone with a DB and > > just stored the solr documents with their ID's one per row in a RDBMS... > > > > The rest sounds like good ideas for your situation as Solr isn't the > > best candidate for the kind of manipulation of data you're proposing > > and a database excels at that. It's more work, but you get a lot more > > flexibility and you de-couple Solr from the data crawling as you say. > > > > It all sounds pretty good to me, but I've only been on the list here a > > short time - so I'll leave it to others to add their comments. > > > > On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov> > > wrote: > > > >> Question: > >> Do any of you have your crawlers write to a database rather than > >> directly to Solr and then use a connector to index to Solr from the > >> database? If so, have you encountered any issues with this approach? > If not, why not? > >> > >> I have searched forums and the Solr/Lucene email archives (including > >> browsing of http://www.apache.org/foundation/public-archives.html) > >> but have not found any discussions of this idea. I am certain that I > >> am not the first person to think of it. I suspect that I have just > >> not figured out the proper queries to find what I am looking for. > >> Please forgive me if this idea has been discussed before and I just > >> couldn't find the discussions. > >> > >> Background: > >> I am new to Solr and have been asked to make improvements to our Solr > >> configurations and crawlers. I have read that the Solr index should > >> not be considered a source of record data. It is in essence a highly > >> optimized index to be used for generating search results rather than > >> a retainer for record copies of data. The better approach is to rely > >> on corporate data sources for record data and retain the ability to > >> completely blow away a Solr index and repopulate it as needed for > changing search requirements. > >> This made me think that perhaps it would be a good idea for us to > >> create a database of crawled data for our Solr index. The idea is > >> that the crawlers would write their findings to a corporate supported > >> database of our own design for our own purposes and then we would > >> populate our Solr index from this database using a connector that > >> writes from the database to the Solr index. > >> The only disadvantage that I can think of for this approach is that > >> we will need to write a simple interface to the database that allows > >> our admin personnel to "Delete" a record from the Solr index. Of > >> course, it won't be deleted from the database but simply flagged as not > to be indexed to Solr. > >> It will then send a delete command to Solr for any successfully > "deleted" > >> records from the database. I suspect this admin interface will grow > >> over time but we really only need to be able to delete records from > >> the database for now. All of the rest of our admin work is query > >> related which can still be done through the Solr Console. > >> I can think of the following advantages: > >> > >> * We have a corporate sponsored and backed up repository for our > >> crawled data which would buffer us from any inadvertent losses of our > >> Solr index. > >> * We would divorce the time it takes to crawl web pages from the > time > >> it takes to populate our Solr index with data from the crawlers. I > >> have found that my Solr Connector takes minutes to populate the > >> entire Solr index from the current Solr prod to the new Solr > >> instances. Compare that to hours and even days to actually crawl the > web pages. > >> * We use URLs for our unique IDs in our Solr index. We can resolve > >> the problem of retaining the shortest URL when duplicate content is > >> detected in Solr simply by sorting the query used to populate Solr > >> from the database by id length descending - this will ensure the last > >> URL encountered for any duplicate is always the shortest. > >> * We can easily ensure that certain classes of crawled content are > >> always added last (or first if you prefer) whenever the data is > >> indexed to Solr - rather than having to rely on the timing of crawlers. > >> * We could quickly and easily rebuild our Solr index from scratch at > >> any time. This would be very valuable when changes to our Solr > >> configurations require re-indexing our data. > >> * We can assign unique boost values to individual "documents" at > index > >> time by assigning a boost value for that document in the database and > >> then applying that boost at index time. > >> * We can continuously run a batch program that removes broken links > >> against this database with no impact to Solr and then refresh Solr on > >> a more frequent basis than we do now because the connector will take > >> minutes rather than hours/days to refresh the content. > >> * We can store additional information for the crawler to populate to > >> Solr when available - such as: > >> * actual document last updated dates > >> * boost value for that document in the database > >> * This database could be used for other purposes such as: > >> * Identifying a subset of representative data to use for > evaluation > >> of configuration changes. > >> * Easy access to "indexed" data for analysis work done by those > not > >> familiar with Solr. > >> Thanks in advance for your feedback. > >> Sincerely, > >> Clay Pryor > >> R&D SE Computer Science > >> 9537 - Knowledge Systems > >> Sandia National Laboratories > >> >