I would fetch the term vectors for the top N documents and add them up myself. 
You could even scale the term counts by the relevance score for the document. 
That would avoid problems with analyzing ten documents where only the first 
three were really good matches.

I did something similar in a different engine for a kNN classifier.

wunder

On May 22, 2013, at 8:12 PM, Otis Gospodnetic wrote:

> Here's a possibility:
> 
> At index time extract important terms (and/or phrases) from this
> story_text and store top N of them in a separate field (which will be
> much smaller/shorter).  Then facet on that.  Or just retrieve it and
> manually parse and count in the client if that turns out to be faster.
> I did this in the previous decade before Solr was available and it
> worked well.  I limited my counting to top N (200?) hits.
> 
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
> 
> On Wed, May 22, 2013 at 10:54 PM, David Larochelle
> <dlaroche...@cyber.law.harvard.edu> wrote:
>> The goal of the system is to obtain data that can be used to generate word
>> clouds so that users can quickly get a sense of the aggregate contents of
>> all documents matching a particular query. For example, a user might want
>> to see a word cloud of all documents discussing 'Iraq' in a particular new
>> papers.
>> 
>> Faceting on story_text gives counts of individual words rather than entire
>> text strings. I think this is because of the tokenization that happens
>> automatically as part of the text_general type. I'm happy to look at
>> alternatives to faceting but I wasn't able to find one that
>> provided aggregate word counts for just the documents matching a particular
>> query rather than an individual documents  or the entire index.
>> 
>> --
>> 
>> David
>> 
>> 
>> On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
>> brendan.grain...@gmail.com> wrote:
>> 
>>> Hi David,
>>> 
>>> Out of interest, what are you trying to accomplish by faceting over the
>>> story_text field? Is it generally the case that the story_text field will
>>> contain values that are repeated or categorize your documents somehow?
>>> From your description: "story_text is used to store free form text
>>> obtained by crawling new papers and blogs", it doesn't seem that way, so
>>> I'm not sure faceting is what you want in this situation.
>>> 
>>> Cheers,
>>> Brendan
>>> 
>>> 
>>> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
>>> dlaroche...@cyber.law.harvard.edu> wrote:
>>> 
>>>> I'm trying to quickly obtain cumulative word frequency counts over all
>>>> documents matching a particular query.
>>>> 
>>>> I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
>>> GB
>>>> and has around ~350,000 documents.
>>>> 
>>>> My schema includes the following fields:
>>>> 
>>>> <field name="id" type="string" indexed="true" stored="true"
>>> required="true"
>>>> multiValued="false" />
>>>> <field name="media_id" type="int" indexed="true" stored="true"
>>>> required="true" multiValued="false" />
>>>> <field name="story_text"  type="text_general" indexed="true"
>>> stored="true"
>>>> termVectors="true" termPositions="true" termOffsets="true" />
>>>> 
>>>> 
>>>> story_text is used to store free form text obtained by crawling new
>>> papers
>>>> and blogs.
>>>> 
>>>> Running faceted searches with the fc or fcs methods fails with the error
>>>> "Too many values for UnInvertedField faceting on field story_text"
>>>> 
>>>> 
>>> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>>>> 
>>>> Running faceted search with the 'enum' method succeeds but takes a very
>>>> long time.
>>>> 
>>>> 
>>> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>>>> <
>>>> 
>>> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>>>>> 
>>>> 
>>>> The frustrating thing is even if the query only returns a few hundred
>>>> documents, it still takes 10 minutes or longer to get the cumulative word
>>>> count results.
>>>> 
>>>> Eventually we're hoping to build a system that will return results in a
>>> few
>>>> seconds and scale to hundreds of millions of documents.
>>>> Is there anyway to get this level of performance out of Solr/Lucene?
>>>> 
>>>> Thanks,
>>>> 
>>>> David
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Brendan Grainger
>>> www.kuripai.com
>>> 

--
Walter Underwood
wun...@wunderwood.org



Reply via email to