Billions of documents?

Tom Burton-West Wed, 22 Aug 2012 08:56:19 -0700

Hi Lance,

I don't understand enough of how the field collapsing is implemented, but I
thought it worked with distributed search.  Are you saying it only works if
everything that needs collapsing is on the same shard?


Tom

On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog <goks...@gmail.com> wrote:

> How do you separate the documents among the shards? Can you set up the
> shards such that one "collapse group" is only on a single shard? That
> you never have to do distributed grouping?
>
> On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
> <tchatter...@commvault.com> wrote:
> > This wont work, see my thread on Solr3.6 Field collapsing
> > Thanks,
> > Tirthankar
> >
> > -----Original Message-----
> > From: Tom Burton-West <tburt...@umich.edu>
> > Date: Tue, 21 Aug 2012 18:39:25
> > To: solr-user@lucene.apache.org<solr-user@lucene.apache.org>
> > Reply-To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> > Cc: William Dueber<dueb...@umich.edu>; Phillip Farber<pfar...@umich.edu>
> > Subject: Scalability of Solr Result Grouping/Field Collapsing:
> >  Millions/Billions of documents?
> >
> > Hello all,
> >
> > We are thinking about using Solr Field Collapsing on a rather large scale
> > and wonder if anyone has experience with performance when doing Field
> > Collapsing on millions of or billions of documents (details below. )  Are
> > there performance issues with grouping large result sets?
> >
> > Details:
> > We have a collection of the full text of 10 million books/journals.  This
> > is spread across 12 shards with each shard holding about 800,000
> > documents.  When a query matches a journal article, we would like to
> group
> > all the matching articles from the same journal together. (there is a
> > unique id field identifying the journal).  Similarly when there is a
> match
> > in multiple copies of the same book we would like to group all results
> for
> > the same book together (again we have a unique id field we can group on).
> > Sometimes a short query against the OCR field will result in over one
> > million hits.  Are there known performance issues when field collapsing
> > result sets containing a million hits?
> >
> > We currently index the entire book as one Solr document.  We would like
> to
> > investigate the feasibility of indexing each page as a Solr document
> with a
> > field indicating the book id.  We could then offer our users the choice
> of
> > a list of the most relevant pages, or a list of the books containing the
> > most relevant pages.  We have approximately 3 billion pages.   Does
> anyone
> > have experience using field collapsing on this sort of scale?
> >
> > Tom
> >
> > Tom Burton-West
> > Information Retrieval Programmer
> > Digital Library Production Service
> > Univerity of Michigan Library
> > http://www.hathitrust.org/blogs/large-scale-search
> > ******************Legal Disclaimer***************************
> > "This communication may contain confidential and privileged
> > material for the sole use of the intended recipient. Any
> > unauthorized review, use or distribution by others is strictly
> > prohibited. If you have received the message in error, please
> > advise the sender by reply email and delete the message. Thank
> > you."
> > *********************************************************
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Reply via email to