Re: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

abhi Abhishek Sun, 15 May 2016 11:56:19 -0700

Clayton

        you could also try running and optimize on the SOLR index as a
weekly/bi weekly maintenance task to keep the segment count in check and
the maxdoc , numdoc count as close as possible (in DB terms de-fragmenting
the solr indexes)


Best Regards,
Abhishek


On Sun, May 15, 2016 at 7:18 PM, Pryor, Clayton J <cjpr...@sandia.gov>
wrote:

> Thank you for your feedback.  I really appreciate you taking the time to
> write it up for me (and hopefully others who might be considering the
> same).  My first thought for dealing with deleted docs was to delete the
> contents and rebuild the index from scratch but my primary customer for the
> deleted docs functionality wants to see it immediately.  I wrote a
> connector for transferring the contents of one Solr Index to another (I
> call it a Solr connector) and that takes a half hour.  As a side note, the
> reason I have multiple indexes is because we currently have physical
> servers for development and production but, as part of my effort, I am
> transitioning us to new VMs for development, quality, and production.  For
> quality control purposes I wanted to be able to reset each with the same
> set of data - thus the Solr connector.
>
> Yes, by connector I am talking about a Java program (using SolrJ) that
> reads from the database and populates the Solr Index.  For now I have had
> our enterprise DBAs create a single table to hold the current index schema
> fields plus some that I can think of that we might use outside of the
> index.  So far it is a completely flat structure so it will be easy to
> index to Solr but I can see, as requirements change, we may have to have a
> more sophisticated database (with multiple tables and greater
> normalization) in which case the connector will have to flatten the data
> for the Solr index.
>
> Thanks again, your response has been very reassuring!
>
> :)
>
> Clay
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, May 13, 2016 5:57 PM
> To: solr-user
> Subject: [EXTERNAL] Re: Does anybody crawl to a database and then index
> from the database to Solr?
>
> Clayton:
>
> I think you've done a pretty thorough investigation, I think you're
> spot-on. The only thing I would add is that you _will_ reindex your entire
> corpus.... multiple times. Count on it. Sometime, somewhere, somebody will
> say "gee, wouldn't it be nice if we could <insert new use-case here>". And
> to support it you'll have to change your Solr schema... which will almost
> certainly require you to re-index.....
>
> The other thing people have done for deleting documents is to create
> triggers in your DB to insert the deleted doc IDs into, say, a "deleted"
> table along with a timestamp. Whenever necessary/desirable, run a cleanup
> task that finds all the IDs since the last time you ran your deleting
> program to remove docs that have been flagged since then.. Obviously you
> also have to keep a record around of the timestamp of the last successful
> run of this program......
>
> Or, frankly, since it takes so little time to rebuild from scratch people
> have foregone any of that complexity and simply rebuild the entire index
> periodically. You can use "collection aliasing" to do this in the
> background and then switch searches atomically, it depends somewhat on how
> long you can wait until you need to see (well, _not_
> see) the deleted docs.
>
> But this is all refinements, I think you're going down the right path.
>
> And when you say "connector", are you talking DIH or an external (say
> SolrJ) program?
>
> Best,
> Erick
>
> On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff <
> j...@johnbickerstaff.com> wrote:
> > I've been working on a less-complex thing along the same lines -
> > taking all the data from our corporate database and pumping it into
> > Kafka for long-term storage -- and the ability to "play back" all the
> > Kafka messages any time we need to re-index.
> >
> > That simpler scenario has worked like a charm.  I don't need to
> > massage the data much once it's at rest in Kafka, so that was a
> > straightforward solution, although I could have gone with a DB and
> > just stored the solr documents with their ID's one per row in a RDBMS...
> >
> > The rest sounds like good ideas for your situation as Solr isn't the
> > best candidate for the kind of manipulation of data you're proposing
> > and a database excels at that.  It's more work, but you get a lot more
> > flexibility and you de-couple Solr from the data crawling as you say.
> >
> > It all sounds pretty good to me, but I've only been on the list here a
> > short time - so I'll leave it to others to add their comments.
> >
> > On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov>
> > wrote:
> >
> >> Question:
> >> Do any of you have your crawlers write to a database rather than
> >> directly to Solr and then use a connector to index to Solr from the
> >> database?  If so, have you encountered any issues with this approach?
> If not, why not?
> >>
> >> I have searched forums and the Solr/Lucene email archives (including
> >> browsing of http://www.apache.org/foundation/public-archives.html)
> >> but have not found any discussions of this idea.  I am certain that I
> >> am not the first person to think of it.  I suspect that I have just
> >> not figured out the proper queries to find what I am looking for.
> >> Please forgive me if this idea has been discussed before and I just
> >> couldn't find the discussions.
> >>
> >> Background:
> >> I am new to Solr and have been asked to make improvements to our Solr
> >> configurations and crawlers.  I have read that the Solr index should
> >> not be considered a source of record data.  It is in essence a highly
> >> optimized index to be used for generating search results rather than
> >> a retainer for record copies of data.  The better approach is to rely
> >> on corporate data sources for record data and retain the ability to
> >> completely blow away a Solr index and repopulate it as needed for
> changing search requirements.
> >> This made me think that perhaps it would be a good idea for us to
> >> create a database of crawled data for our Solr index.  The idea is
> >> that the crawlers would write their findings to a corporate supported
> >> database of our own design for our own purposes and then we would
> >> populate our Solr index from this database using a connector that
> >> writes from the database to the Solr index.
> >> The only disadvantage that I can think of for this approach is that
> >> we will need to write a simple interface to the database that allows
> >> our admin personnel to "Delete" a record from the Solr index.  Of
> >> course, it won't be deleted from the database but simply flagged as not
> to be indexed to Solr.
> >> It will then send a delete command to Solr for any successfully
> "deleted"
> >> records from the database.  I suspect this admin interface will grow
> >> over time but we really only need to be able to delete records from
> >> the database for now.  All of the rest of our admin work is query
> >> related which can still be done through the Solr Console.
> >> I can think of the following advantages:
> >>
> >>   *   We have a corporate sponsored and backed up repository for our
> >> crawled data which would buffer us from any inadvertent losses of our
> >> Solr index.
> >>   *   We would divorce the time it takes to crawl web pages from the
> time
> >> it takes to populate our Solr index with data from the crawlers.  I
> >> have found that my Solr Connector takes minutes to populate the
> >> entire Solr index from the current Solr prod to the new Solr
> >> instances.  Compare that to hours and even days to actually crawl the
> web pages.
> >>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
> >> the problem of retaining the shortest URL when duplicate content is
> >> detected in Solr simply by sorting the query used to populate Solr
> >> from the database by id length descending - this will ensure the last
> >> URL encountered for any duplicate is always the shortest.
> >>   *   We can easily ensure that certain classes of crawled content are
> >> always added last (or first if you prefer) whenever the data is
> >> indexed to Solr - rather than having to rely on the timing of crawlers.
> >>   *   We could quickly and easily rebuild our Solr index from scratch at
> >> any time.  This would be very valuable when changes to our Solr
> >> configurations require re-indexing our data.
> >>   *   We can assign unique boost values to individual "documents" at
> index
> >> time by assigning a boost value for that document in the database and
> >> then applying that boost at index time.
> >>   *   We can continuously run a batch program that removes broken links
> >> against this database with no impact to Solr and then refresh Solr on
> >> a more frequent basis than we do now because the connector will take
> >> minutes rather than hours/days to refresh the content.
> >>   *   We can store additional information for the crawler to populate to
> >> Solr when available - such as:
> >>      *   actual document last updated dates
> >>      *   boost value for that document in the database
> >>   *   This database could be used for other purposes such as:
> >>      *   Identifying a subset of representative data to use for
> evaluation
> >> of configuration changes.
> >>      *   Easy access to "indexed" data for analysis work done by those
> not
> >> familiar with Solr.
> >> Thanks in advance for your feedback.
> >> Sincerely,
> >> Clay Pryor
> >> R&D SE Computer Science
> >> 9537 - Knowledge Systems
> >> Sandia National Laboratories
> >>
>

Re: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

Reply via email to