Hi Jigar, The link you shared http://search.carrot2.org is really nice a lot of it's features actually has my requirements.
Thanks for the share <http://search.carrot2.org> On Mon, Feb 16, 2015 at 9:20 AM, Maisnam Ns <maisnam...@gmail.com> wrote: > Hi Denis, > > Looks good and thanks for the links. And one more help , once finding the > top ten say 'Lucene' -1000 , 'search' -789 , I need to a quick span query > on 'Lucene' say e.g 'Companies use Lucene for searching' , some phrases > containing 'Lucene'. I tried using this > http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but > this does not work in memory . Is there a java library where I can do a > quick search on span query. > > Thanks > > On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov <dot...@gmail.com> wrote: > >> Either you have to index those words in a facet or calculate top 10 words >> on-the-fly. Last approach could be effective enough if you have ability to >> read those documents quickly. The calculation of Top 10 words could be done >> pretty easily in terms of memory and CPU, because there is no need to do >> sorting in fact (see https://github.com/addthis/stream-lib). >> >> > On Feb 14, 2015, at 04:34, Maisnam Ns <maisnam...@gmail.com> wrote: >> > >> > Hi Jigar, >> > >> > Thanks for the clustering algorithm will see if it can be applied. >> > >> > These are not known fields as these documents are coming from some other >> > search engine. Every time the user changes his search string the >> documents >> > will vary but I am assuming here the worst case scenario say about >> 100000 >> > documents. For faceted search also we need to know in advance the >> facets. >> > >> > You search for a string it gives bunch of documents containing some >> summary >> > of the document and all I have to do is quickly find top 10 words from >> > these documents from the summary which will vary depending on the search >> > query. The response time is the problem it has to be in just a few >> seconds >> > and memory is the issue here. >> > >> > Again thanks for that link will look into it. If you find some solution >> > please let me know. >> > >> > Thanks >> > >> > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <jigaronl...@gmail.com> >> wrote: >> > >> >> If those are the known fields in the documents, you may extract words >> while >> >> indexing and create facets. Lucene supports faceted search which can >> give >> >> you Top n counts of such fields, which is much more efficient. >> >> >> >> Another option is apply clustering algorithm on results which can >> provide >> >> Top n words, you can refer http://search.carrot2.org >> >> >> >> >> >> >> >> >> >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <maisnam...@gmail.com> >> wrote: >> >> >> >>> Hi, >> >>> >> >>> Can someone help me with this use case: >> >>> >> >>> 1. I have to search a string and let's say the search engine(it is not >> >>> lucene) found this string in 100,000 documents. I need to find the >> top >> >> 10 >> >>> words occurring in this 100000 documents.As the document size is large >> >> how >> >>> to further index these documents and find the top 10 words >> >>> >> >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and >> find >> >>> the most occurring top 10 words. >> >>> 2. Is this the right approach , indexing and writing to the disk >> would be >> >>> almost over kill and a user can search any number of times. >> >>> >> >>> Thanks in advance. >> >>> >> >> >> >> --- >> Denis Bazhenov <dot...@gmail.com> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >