Hi Denis, Looks good and thanks for the links. And one more help , once finding the top ten say 'Lucene' -1000 , 'search' -789 , I need to a quick span query on 'Lucene' say e.g 'Companies use Lucene for searching' , some phrases containing 'Lucene'. I tried using this http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but this does not work in memory . Is there a java library where I can do a quick search on span query.
Thanks On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov <dot...@gmail.com> wrote: > Either you have to index those words in a facet or calculate top 10 words > on-the-fly. Last approach could be effective enough if you have ability to > read those documents quickly. The calculation of Top 10 words could be done > pretty easily in terms of memory and CPU, because there is no need to do > sorting in fact (see https://github.com/addthis/stream-lib). > > > On Feb 14, 2015, at 04:34, Maisnam Ns <maisnam...@gmail.com> wrote: > > > > Hi Jigar, > > > > Thanks for the clustering algorithm will see if it can be applied. > > > > These are not known fields as these documents are coming from some other > > search engine. Every time the user changes his search string the > documents > > will vary but I am assuming here the worst case scenario say about 100000 > > documents. For faceted search also we need to know in advance the facets. > > > > You search for a string it gives bunch of documents containing some > summary > > of the document and all I have to do is quickly find top 10 words from > > these documents from the summary which will vary depending on the search > > query. The response time is the problem it has to be in just a few > seconds > > and memory is the issue here. > > > > Again thanks for that link will look into it. If you find some solution > > please let me know. > > > > Thanks > > > > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <jigaronl...@gmail.com> > wrote: > > > >> If those are the known fields in the documents, you may extract words > while > >> indexing and create facets. Lucene supports faceted search which can > give > >> you Top n counts of such fields, which is much more efficient. > >> > >> Another option is apply clustering algorithm on results which can > provide > >> Top n words, you can refer http://search.carrot2.org > >> > >> > >> > >> > >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <maisnam...@gmail.com> > wrote: > >> > >>> Hi, > >>> > >>> Can someone help me with this use case: > >>> > >>> 1. I have to search a string and let's say the search engine(it is not > >>> lucene) found this string in 100,000 documents. I need to find the top > >> 10 > >>> words occurring in this 100000 documents.As the document size is large > >> how > >>> to further index these documents and find the top 10 words > >>> > >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and > find > >>> the most occurring top 10 words. > >>> 2. Is this the right approach , indexing and writing to the disk would > be > >>> almost over kill and a user can search any number of times. > >>> > >>> Thanks in advance. > >>> > >> > > --- > Denis Bazhenov <dot...@gmail.com> > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >