Re: Top 10 words

Maisnam Ns Sun, 15 Feb 2015 20:46:32 -0800

Hi Jigar,

The link you shared http://search.carrot2.org
is really nice a lot of it's features actually has my requirements.


Thanks for the share
<http://search.carrot2.org>

On Mon, Feb 16, 2015 at 9:20 AM, Maisnam Ns <maisnam...@gmail.com> wrote:

> Hi Denis,
>
> Looks good and thanks for the links. And one more help , once finding the
> top ten say  'Lucene' -1000 , 'search' -789 , I need to a quick span query
> on 'Lucene'  say e.g 'Companies use Lucene for searching' , some phrases
> containing 'Lucene'. I tried using this
> http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but
> this does not work in memory . Is there a java library where I can do a
> quick search on span query.
>
> Thanks
>
> On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov <dot...@gmail.com> wrote:
>
>> Either you have to index those words in a facet or calculate top 10 words
>> on-the-fly. Last approach could be effective enough if you have ability to
>> read those documents quickly. The calculation of Top 10 words could be done
>> pretty easily in terms of memory and CPU, because there is no need to do
>> sorting in fact (see https://github.com/addthis/stream-lib).
>>
>> > On Feb 14, 2015, at 04:34, Maisnam Ns <maisnam...@gmail.com> wrote:
>> >
>> > Hi Jigar,
>> >
>> > Thanks for the clustering algorithm will see if it can be applied.
>> >
>> > These are not known fields as these documents are coming from some other
>> > search engine. Every time the user changes his search string the
>> documents
>> > will vary but I am assuming here the worst case scenario say about
>> 100000
>> > documents. For faceted search also we need to know in advance the
>> facets.
>> >
>> > You search for a string it gives bunch of documents containing some
>> summary
>> > of the document and all I have to do is quickly find top 10 words from
>> > these documents from the summary which will vary depending on the search
>> > query. The response time is the problem it has to be in just  a few
>> seconds
>> > and memory is the issue here.
>> >
>> > Again thanks for that link will look into it. If you find some solution
>> > please let me know.
>> >
>> > Thanks
>> >
>> > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <jigaronl...@gmail.com>
>> wrote:
>> >
>> >> If those are the known fields in the documents, you may extract words
>> while
>> >> indexing and create facets. Lucene supports faceted search which can
>> give
>> >> you Top n counts of such fields, which is much more efficient.
>> >>
>> >> Another option is apply clustering algorithm on results which can
>> provide
>> >> Top n words, you can refer http://search.carrot2.org
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <maisnam...@gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> Can someone help me with this use case:
>> >>>
>> >>> 1. I have to search a string and let's say the search engine(it is not
>> >>> lucene) found this string in 100,000 documents.  I need to find the
>> top
>> >> 10
>> >>> words occurring in this 100000 documents.As the document size is large
>> >> how
>> >>> to further index these documents and find the top 10 words
>> >>>
>> >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and
>> find
>> >>> the most occurring top 10 words.
>> >>> 2. Is this the right approach , indexing and writing to the disk
>> would be
>> >>> almost over kill and a user can search any number of times.
>> >>>
>> >>> Thanks in advance.
>> >>>
>> >>
>>
>> ---
>> Denis Bazhenov <dot...@gmail.com>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

Re: Top 10 words

Reply via email to