First answer:

My employer is a library and do not have the license to harvest everything
indexed by a "web-scale discovery service" such as PRIMO or Summon.    If
our design automatically relays searches entered by users, and then
periodically purges results, I think it is reasonable from a licensing
perspective.

Second answer:

What if you wanted your Apache Solr powered search to include all results
from Google scholar to any query?   Do you think you could easily or
cheaply configure a Zookeeper cluster large enough to harvest and index all
of Google Scholar?   Would that violate robot rules?    Is it even possible
to do this from an API perspective?   Wouldn't google notice?

Third answer:

On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
other Enterprise Search firm based on Apache Solr were dinged on the lack
of Federated Search.  I do not have the hubris to think I can fix that, and
it is not really my role to try, but something that works without
Harvesting and local indexing is obviously desirable to Enterprise Search
users.



On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht <p...@hoplahup.net> wrote:

>
> Why not simply create a meta search engine that indexes everything of each
> of the nodes.?
> (I think one calls this harvesting)
>
> I believe that this the way to avoid all sorts of performance bottleneck.
> As far as I could analyze, the performance of a federated search is the
> performance of the least speedy node; which can turn to be quite bad if you
> do not exercise guarantees of remote sources.
>
> Or are the "remote cores" below actually things that you manage on your
> side? If yes guarantees are easy to manage..
>
> Paul
>
>
> Le 26 août 2013 à 22:38, Dan Davis a écrit :
>
> > I have now come to the task of estimating man-days to add "Blended Search
> > Results" to Apache Solr.   The argument has been made that this is not
> > desirable (see Jonathan Rochkind's blog entries on Bento search with
> > blacklight).   But the estimate remains.    No estimate is worth much
> > without a design.   So, I am come to the difficult of estimating this
> > without having an in-depth knowledge of the Apache core.   Here is my
> > design, likely imperfect, as it stands.
> >
> >   - Configure a core specific to each search source (local or remote)
> >   - On cores that index remote content, implement a periodic delete query
> >   that deletes documents whose timestamp is too old
> >   - Implement a custom requestHandler for the "remote" cores that goes
> out
> >   and queries the remote source.   For each result in the top N
> >   (configurable), it computes an id that is stable (e.g. it is based on
> the
> >   remote resource URL, doi, or hash of data returned).   It uses that id
> to
> >   look-up the document in the lucene database.   If the data is not
> there, it
> >   updates the lucene core and sets a flag that commit is required.
> Once it
> >   is done, it commits if needed.
> >   - Configure a core that uses a custom SearchComponent to call the
> >   requestHandler that goes and gets new documents and commits them.
> Since
> >   the cores for remote content are different cores, they can restart
> their
> >   searcher at this point if any commit is needed.   The custom
> >   SearchComponent will wait for commit and reload to be completed.
> Then,
> >   search continues uses the other cores as "shards".
> >   - Auto-warming on this will assure that the most recently requested
> data
> >   is present.
> >
> > It will, of course, be very slow a good part of the time.
> >
> > Erik and others, I need to know whether this design has legs and what
> other
> > alternatives I might consider.
> >
> >
> >
> > On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
> >
> >> The lack of global TF/IDF has been answered in the past,
> >> in the sharded case, by "usually you have similar enough
> >> stats that it doesn't matter". This pre-supposes a fairly
> >> evenly distributed set of documents.
> >>
> >> But if you're talking about federated search across different
> >> types of documents, then what would you "rescore" with?
> >> How would you even consider scoring docs that are somewhat/
> >> totally different? Think magazine articles an meta-data associated
> >> with pictures.
> >>
> >> What I've usually found is that one can use grouping to show
> >> the top N of a variety of results. Or show tabs with different
> >> types. Or have the app intelligently combine the different types
> >> of documents in a way that "makes sense". But I don't know
> >> how you'd just get "the right thing" to happen with some kind
> >> of scoring magic.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansm...@gmail.com> wrote:
> >>
> >>> I've thought about it, and I have no time to really do a meta-search
> >>> during
> >>> evaluation.  What I need to do is to create a single core that contains
> >>> both of my data sets, and then describe the architecture that would be
> >>> required to do blended results, with liberal estimates.
> >>>
> >>> From the perspective of evaluation, I need to understand whether any of
> >>> the
> >>> solutions to better ranking in the absence of global IDF have been
> >>> explored?    I suspect that one could retrieve a much larger than N
> set of
> >>> results from a set of shards, re-score in some way that doesn't require
> >>> IDF, e.g. storing both results in the same priority queue and
> *re-scoring*
> >>> before *re-ranking*.
> >>>
> >>> The other way to do this would be to have a custom SearchHandler that
> >>> works
> >>> differently - it performs the query, retries all results deemed
> relevant
> >>> by
> >>> another engine, adds them to the Lucene index, and then performs the
> query
> >>> again in the standard way.   This would be quite slow, but perhaps
> useful
> >>> as a way to evaluate my method.
> >>>
> >>> I still welcome any suggestions on how such a SearchHandler could be
> >>> implemented.
> >>>
> >>
> >>
>
>

Reply via email to