Re: Keyword extraction

Aleksander M. Stensby Thu, 27 Nov 2008 04:59:04 -0800

Hi again Patrick.

Glad to hear that we can contribute to help you guys. Thats what thismailing list is for:)


First of all, I think you use the wrong parameter to get your terms.

Take a look athttp://lucene.apache.org/solr/api/org/apache/solr/common/params/MoreLikeThisParams.htmlto see the supported params.In your string you use mlt.displayTerms=list, which i believe should bemlt.interestingTerms=list.


If that doesn't work:

One thing you should know is that from what i can tell, you are using theStandardRequestHandler in your querying. The StandardRequestHandlersupports a simplified handling of more like these queries, namely; "Thismethod returns similar documents for each document in the response set."it supports the common mlt parameters, needs mlt=true (as you have done)and supports a mlt.count parameter to specify the number of similardocuments returned for each matching doc from your query.

If you want to get the "top keywords" etc, (and in essence yourmlt.interestingTerms=list parameter to have any effect at all, if I'm notcompletely wrong), you will need to configure up a MoreLikeThisHandler inyour solrconfig.xml and then map that to your query.


From the sample configuration file:

incoming queries will be dispatched to the correct handler based on thepath or the qt (query type) param. Names starting with a '/' are accessedwith the a path equal to the registered name. Names without a leading '/'are accessed with: http://host/app/select?qt=name If no qt is defined, therequestHandler that declares default="true" will be used.

You can read about the MoreLikeThisHandler here:http://wiki.apache.org/solr/MoreLikeThisHandler


Once you have it configured properly your query would be something like:

http://localhost:8983/solr/mlt?q=amsterdam&mlt.fl=text&mlt.interestingTerms=list&mlt=true(don't think you need the mlt=true here tho...)

or
http://localhost:8983/solr/select?qt=mlt&q=amsterdam&mlt.fl=text&mlt.interestingTerms=list&mlt=true
(in the last example I use qt=mlt)

Hope this helps.
Regards,
 Aleksander

On Thu, 27 Nov 2008 11:49:30 +0100, Plaatje, Patrick<[EMAIL PROTECTED]> wrote:

Hi Aleksander,

With all the help of you and the other comments, we're now at a pointwhere a MoreLikeThis list is returned, and shows 10 related records.However on the query executed there are no keywords whatsoever beingreturned. Is the querystring still wrong or is something else required?


The querystring we're currently executing is:

http://suempnr3:8080/solr/select/?q=amsterdam&mlt.fl=text&mlt.displayTerms=list&mlt=true


Best,

Patrick

-----Original Message-----
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 15:07
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Ah, yes, That is important. In lucene, the MLT will see if the termvector is stored, and if it is not it will still be able to perform thequerying, but in a much much much less efficient way.. Lucene willanalyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSEDwill be used to limit the number of tokens that will be parsed). (don'twant to go into details on this since I haven't really dug through thecode:p) But when the field isn't stored either, it is rather difficultto re-analyze the

document;)

On a general note, if you want to "really" understand how the MLT works,take a look at the wiki or read this thorough blog post:

http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

Regards,
  Aleksander

On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick<[EMAIL PROTECTED]> wrote:

Hi Aleksander,

This was a typo on my end, the original query included a semicolon
instead of an equal sign. But I think it has to do with my field not
being stored and not being identified as termVectors="true". I'm
recreating the index now, and see if this fixes the problem.

Best,

patrick

-----Original Message-----
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 14:37
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Hi there!
Well, first of all i think you have an error in your query, if I'm not
mistaken.
You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called "id", you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the &debugQuery=on at the end of your request url,
to see debug output on how the query is parsed and if/how any
documents are matched against your query.
Hope this helps.

Cheers,
  Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick
<[EMAIL PROTECTED]> wrote:

Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to
explore for me as I'm just starting to grasp the matter. Do you know
why I'm not getting any results with the query posted earlier then?
It gives me the folowing only:

<lst name="moreLikeThis">
        <result name="18477975" numFound="0" start="0"/> </lst>

Instead of delivering details of the interestingTerms.

Thanks in advance

Patrick


-----Original Message-----
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
Sent: woensdag 26 november 2008 13:03
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

I do not agree with you at all. The concept of MoreLikeThis is based
on the fundamental idea of TF-IDF weighting, and not term frequency
alone.
Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi
l ar/MoreLikeThis.html As you can see, it is possible to use cut-off
thresholds to significantly reduce the number of unimportant terms,
and generate highly suitable queries based on the tf-idf frequency of
the term, since as you point out, high frequency terms alone tends to
be useless for querying, but taking the document frequency into
account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e
2
2ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms
will be ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words
will be ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie
<[EMAIL PROTECTED]>
wrote:

Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis
function in solr, I conducted:

MoreLikeThis uses term vectors to ranks all the terms from a
document by its frequency. According to its ranking, it will start
to generate queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially
generating queries based on most frequent terms.

There's a big problem with "most frequent terms"  from documents.
Most frequent words are usually meaningless, or so called function
words, or, people from Information Retrieval like to call them
stopwords.
However, ignoring  technical problems of implementation of
moreLikeThis function, this approach is very dangerous, since
queries are generated artificially based on a given document.
Writting queries for retrieving a document is a human task, and it
assumes some knowledge (user knows what document he wants).

I advice to use others approaches, depending on your expectation.
For example, you can extract similar documents just by searching for
documents with similar title (more like this doesn't work in this
case).

I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick
<[EMAIL PROTECTED]>
wrote:
From: Plaatje, Patrick <[EMAIL PROTECTED]>
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.int
e r es tingTerms=list&mlt=true&mlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick

Re: Keyword extraction

Reply via email to