It scales well with tens of millions of documents or more with enough
RAM, provided you have a mechanism for expiring the cache when a reader
is no longer in use and pool cached items for re-use so your process
doesn't incur massive GC sweeps when the cache is expired.

Michael

-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com] 
Sent: Thursday, November 12, 2009 12:35 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: FieldLookup for field with multiple values

I shudder to think about what the higher-ups will say about the cost of
faster storage.  They're very stingy on things that make our lives
easier.
:)

I like the idea of caching the term vectors somehow, though I wonder how
that would scale with millions of documents.  I will add that to the
list of
things to prototype for our long-term solution.

Thanks for the feedback,

---Matt

On Thu, Nov 12, 2009 at 12:14 PM, Michael Garski
<mgar...@myspace-inc.com>wrote:

> Matt,
>
> Metadata can be collected during the course of a search in a
> HitCollector (note that 2.9 will deprecate the current HitCollector in
> favor of the more flexible Collector).  One approach to collecting
> facets is to index the field with the desired facet metadata with
> TermVectors enabled.  During collection you can retrieve the
> TermFreqVector for a document with Reader.GetTermFreqVector for each
> document and collect/aggregate metadata however you like.
>
> Making the call to get the TermFreqVector for each hit will be a
> performance hit due to IO and deserialization (even though it's very
> fast, 50K times in one search is a lot).  At that point you can add
> faster IO (Fusion IO rules this space) or during an index warmup
create
> an in-memory structure with lookups to the vectors keyed off the
> document id.
>
> Hope that helps.
>
> Michael
>
> -----Original Message-----
> From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> Sent: Thursday, November 12, 2009 7:13 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Re: FieldLookup for field with multiple values
>
> I can elaborate a little on what our *planned* approach for utilizing
> SQL
> Server is.  I don't know if this will work, but I've done similar
things
> with SQL CLR and haven't had it explode (yet), so I'm hopeful.
Anyway:
>
> Our system needs two types of output, basically: the full-text reports
> (I'll
> call them documents to reduce ambiguity), and then statistical reports
> built
> on those documents.  The documents can easily be retrieved from
Lucene,
> so
> the challenge is building the reports.  For that, my plan is to submit
> the
> same query string to SQL Server that I sent to Lucene.  Internally,
SQL
> Server would then pass the query back to Lucene and retrieve a list of
> document IDs that matched.  The communication may be over WCF or
> something
> similar, and will be compressed during transit to reduce IO overhead.
> Once
> SQL Server has the IDs, they will be loaded into a temporary table
(with
> indexes) or a table variable, which will then be used to filter the
> metadata
> that the statistical reports are built from.
>
> I have no idea how such a system would perform.  I do hope to do some
> feasibility tests sometime Real Soon (like in the next few weeks), and
> I'll
> post my results if I manage to get it working.
>
> If anyone has any other suggestions, please do share.
>
> On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie <
> mmcco...@oxford-analytica.com> wrote:
>
> > >While we're discussing this, anyone have any advice or suggestions
> for
> > a better solution?  We've considered a few things for our long-term
> > solution.
> >
> > I'd be very interested to hear thoughts on intersecting SQL and
Lucene
> > too, as in our case we have very large lists of organisations which
> have
> > different permissions (stored in SQL) for different documents stored
> in
> > Lucene. Showing in the search results only those documents to which
> the
> > organisation has permission is quite expensive for queries with lots
> of
> > results. Storing it in the documents is not manageable because they
> need
> > to be updated frequently across multiple documents. Currently we
> > precompute a list for each organisation, cache that in memory, and
> > recache it every time that organisation is updated. However this too
> is
> > costly.
> >
> > Storing the Lucene document nos in SQL during indexing, and then
> passing
> > the list of Lucene document nos matching a search to SQL for
filtering
> > seems the right way to go. But Matt is right, the problem is with
> > scaling this to searches returning many thousands of documents.
> >
> > Yours,
> > M.
> > -------------------------------------
> > Moray McConnachie
> > Director of IT    +44 1865 261 600
> > Oxford Analytica  http://www.oxan.com
> >
> > -----Original Message-----
> > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > Sent: 12 November 2009 13:53
> > To: lucene-net-user@incubator.apache.org
> > Subject: Re: FieldLookup for field with multiple values
> >
> > Yeah, it is sort of like your standard faceting scenario, except
there
> > are about 20,000 facets (organizations), and there's complex
> > relationships among the facets.
> >
> > The reports we're dealing with only occasionally break the funding
up
> by
> > organization, so we decided (for now) to just store a single funding
> > value, then break it up after-the-fact by dividing it by the number
of
> > organizations.  So no, the funding is only stored once.
> >
> > While we're discussing this, anyone have any advice or suggestions
for
> a
> > better solution?  We've considered a few things for our long-term
> > solution.
> > One is to put this metadata in a SQL Server instance, and use SQL
CLR
> to
> > build a temporary table based on document IDs from a Lucene index
> > (hosted over WCF or something similar), then do the reporting within
> SQL
> > Server.  We plan to compress the list of IDs going back from Lucene
to
> > SQL Server to cut down on IO overhead, but we're still concerned
that
> > approach won't scale as we go from hundreds of thousands to millions
> of
> > reports.
> >
> > Another option we've discussed is to precompute data cubes and use
> these
> > to calculate reporting information.  The concern here is the high
> > dimensionality of the data (we have about 20,000 distinct
> organizations
> > now, but fully expect that to increase by an order of magnitude) as
> well
> > as the accuracy of the generated reports, since there's (probably)
not
> a
> > good way to divide the cube based on arbitrary Lucene queries.
> >
> > On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski
> > <mgar...@myspace-inc.com>wrote:
> >
> > > Sounds like a full-text search with the results simply being
facets
> on
> >
> > > the organizations sorted by the funding amount?
> > >
> > > You mentioned adding the org ID once for each document.  Do you do
> the
> >
> > > same for the funding, with the funding for each corresponding
> > organization?
> > >
> > > Michael
> > >
> > >
> > > -----Original Message-----
> > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > > Sent: Wed 11/11/2009 10:17 PM
> > > To: lucene-net-user@incubator.apache.org
> > > Subject: Re: FieldLookup for field with multiple values
> > >
> > > Well, let me prefix what I'm about to describe by saying that I
know
> > > that I'm doing something with Lucene that it wasn't meant to do.
> This
> >
> > > is for a "proof of concept" system that I'm helping put together
on
> a
> > > tight schedule with very limited resources, and we're trying to
get
> to
> >
> > > a mostly-working state as quickly as possible.
> > >
> > > That said, we are basically storing reports in Lucene.  The
reports
> > > are fairly standard documents for the most part: they have a
title,
> > > body, abstract, etc, all of which we index and search with Lucene.
> > > However, they also have a few fields that aren't standard,
including
> a
> >
> > > list of involved organizations as well as a dollar amount for each
> > > report.  The organizations are stored as IDs, and we add the org
ID
> > > field multiple times, once for each organization involved in the
> > > report.  The funding is also stored as a non-indexed field on the
> > > Lucene document.
> > >
> > > What I'm trying to do is build a quick-and-dirty org-by-dollar
> report
> > > off of the reports that match the user's query.  So, a query for
> > > "aerospace" might match 50,000 documents, and I want to show the
> user
> > > the top 5 organizations in terms of dollars.  Again, I know
> reporting
> > > like this isn't what Lucene was meant for, and we do have some
ideas
> > > on how to handle it long-term, but for now, I'm trying to get it
> > > working as well as I can using Lucene alone, and Lucene does do a
> > > great job of finding the relevant set of documents to build a
report
> > > from.
> > >
> > > On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski
> > > <mgar...@myspace-inc.com
> > > >wrote:
> > >
> > > > Matt,
> > > >
> > > > StringIndex is for use when a field has only one value in it for
> the
> >
> > > > purposes of sorting results, not for tokenized fields with
> multiple
> > > > values.  TermVectors might be a better approach, but for 50K
docs,
> > > > you'll encounter an IO hit on reading them.
> > > >
> > > > I'm curious why you are looking to grab all of the terms for a
> > > > ScoreDoc...  can you shed some light on that?
> > > >
> > > > Michael
> > > >
> > > > -----Original Message-----
> > > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > > > Sent: Wednesday, November 11, 2009 4:57 PM
> > > > To: lucene-net-user@incubator.apache.org
> > > > Subject: FieldLookup for field with multiple values
> > > >
> > > > It seems that the StringIndex returned by
> > > > FieldCache.Fields.Default.GetStringIndex() only indexes one
value
> > > > for a document even when the document has multiple values for
the
> > > > field.  Is there a performant want to get all the values for a
> > > > particular field in a ScoreDoc?  I'm having to do this across
the
> > > > entire result set of ScoreDocs (up to 50,000), and retrieving
the
> > > > values through LuceneDocument.GetFields is not going to cut it.
> > > >
> > > >
> > >
> > >
> > >
> >
> >
>
>

Reply via email to