Either you have to index those words in a facet or calculate top 10 words 
on-the-fly. Last approach could be effective enough if you have ability to read 
those documents quickly. The calculation of Top 10 words could be done pretty 
easily in terms of memory and CPU, because there is no need to do sorting in 
fact (see https://github.com/addthis/stream-lib).

> On Feb 14, 2015, at 04:34, Maisnam Ns <maisnam...@gmail.com> wrote:
> 
> Hi Jigar,
> 
> Thanks for the clustering algorithm will see if it can be applied.
> 
> These are not known fields as these documents are coming from some other
> search engine. Every time the user changes his search string the documents
> will vary but I am assuming here the worst case scenario say about 100000
> documents. For faceted search also we need to know in advance the facets.
> 
> You search for a string it gives bunch of documents containing some summary
> of the document and all I have to do is quickly find top 10 words from
> these documents from the summary which will vary depending on the search
> query. The response time is the problem it has to be in just  a few seconds
> and memory is the issue here.
> 
> Again thanks for that link will look into it. If you find some solution
> please let me know.
> 
> Thanks
> 
> On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <jigaronl...@gmail.com> wrote:
> 
>> If those are the known fields in the documents, you may extract words while
>> indexing and create facets. Lucene supports faceted search which can give
>> you Top n counts of such fields, which is much more efficient.
>> 
>> Another option is apply clustering algorithm on results which can provide
>> Top n words, you can refer http://search.carrot2.org
>> 
>> 
>> 
>> 
>> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <maisnam...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Can someone help me with this use case:
>>> 
>>> 1. I have to search a string and let's say the search engine(it is not
>>> lucene) found this string in 100,000 documents.  I need to find the top
>> 10
>>> words occurring in this 100000 documents.As the document size is large
>> how
>>> to further index these documents and find the top 10 words
>>> 
>>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and find
>>> the most occurring top 10 words.
>>> 2. Is this the right approach , indexing and writing to the disk would be
>>> almost over kill and a user can search any number of times.
>>> 
>>> Thanks in advance.
>>> 
>> 

---
Denis Bazhenov <dot...@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to