Re: Top 10 words

Maisnam Ns Sun, 15 Feb 2015 19:50:51 -0800

Hi Denis,

Looks good and thanks for the links. And one more help , once finding the
top ten say  'Lucene' -1000 , 'search' -789 , I need to a quick span query
on 'Lucene'  say e.g 'Companies use Lucene for searching' , some phrases
containing 'Lucene'. I tried using this
http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but this
does not work in memory . Is there a java library where I can do a quick
search on span query.


Thanks

On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov <dot...@gmail.com> wrote:

> Either you have to index those words in a facet or calculate top 10 words
> on-the-fly. Last approach could be effective enough if you have ability to
> read those documents quickly. The calculation of Top 10 words could be done
> pretty easily in terms of memory and CPU, because there is no need to do
> sorting in fact (see https://github.com/addthis/stream-lib).
>
> > On Feb 14, 2015, at 04:34, Maisnam Ns <maisnam...@gmail.com> wrote:
> >
> > Hi Jigar,
> >
> > Thanks for the clustering algorithm will see if it can be applied.
> >
> > These are not known fields as these documents are coming from some other
> > search engine. Every time the user changes his search string the
> documents
> > will vary but I am assuming here the worst case scenario say about 100000
> > documents. For faceted search also we need to know in advance the facets.
> >
> > You search for a string it gives bunch of documents containing some
> summary
> > of the document and all I have to do is quickly find top 10 words from
> > these documents from the summary which will vary depending on the search
> > query. The response time is the problem it has to be in just  a few
> seconds
> > and memory is the issue here.
> >
> > Again thanks for that link will look into it. If you find some solution
> > please let me know.
> >
> > Thanks
> >
> > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <jigaronl...@gmail.com>
> wrote:
> >
> >> If those are the known fields in the documents, you may extract words
> while
> >> indexing and create facets. Lucene supports faceted search which can
> give
> >> you Top n counts of such fields, which is much more efficient.
> >>
> >> Another option is apply clustering algorithm on results which can
> provide
> >> Top n words, you can refer http://search.carrot2.org
> >>
> >>
> >>
> >>
> >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <maisnam...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Can someone help me with this use case:
> >>>
> >>> 1. I have to search a string and let's say the search engine(it is not
> >>> lucene) found this string in 100,000 documents.  I need to find the top
> >> 10
> >>> words occurring in this 100000 documents.As the document size is large
> >> how
> >>> to further index these documents and find the top 10 words
> >>>
> >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and
> find
> >>> the most occurring top 10 words.
> >>> 2. Is this the right approach , indexing and writing to the disk would
> be
> >>> almost over kill and a user can search any number of times.
> >>>
> >>> Thanks in advance.
> >>>
> >>
>
> ---
> Denis Bazhenov <dot...@gmail.com>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Top 10 words

Reply via email to