Re: Keyword extraction

Jeff Newburn Wed, 26 Nov 2008 06:10:02 -0800

Unfortunately, as it stands the interestingTerms and the debugQuery do not
explain why solr chose the matches it did for moreLikeThis.  There is
currently a task in jira to try to add the information to debugQuery.


The ticket can be found here: https://issues.apache.org/jira/browse/SOLR-860

-Jeff


On 11/26/08 5:41 AM, "Plaatje, Patrick" <[EMAIL PROTECTED]>
wrote:

> Hi Aleksander,
> 
> This was a typo on my end, the original query included a semicolon instead of
> an equal sign. But I think it has to do with my field not being stored and not
> being identified as termVectors="true". I'm recreating the index now, and see
> if this fixes the problem.
> 
> Best,
> 
> patrick
> 
> -----Original Message-----
> From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
> Sent: woensdag 26 november 2008 14:37
> To: solr-user@lucene.apache.org
> Subject: Re: Keyword extraction
> 
> Hi there!
> Well, first of all i think you have an error in your query, if I'm not
> mistaken.
> You say http://localhost:8080/solr/select/?q=id=18477975...
> but since you are referring to the field called "id", you must say:
> http://localhost:8080/solr/select/?q=id:18477975...
> (use colon instead of the equals sign).
> I think that will do the trick.
> If not, try adding the &debugQuery=on at the end of your request url, to see
> debug output on how the query is parsed and if/how any documents are matched
> against your query.
> Hope this helps.
> 
> Cheers,
>   Aleksander
> 
> 
> 
> On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick
> <[EMAIL PROTECTED]> wrote:
> 
>> Hi Aleksander,
>> 
>> Thanx for clearing this up. I am confident that this is a way to
>> explore for me as I'm just starting to grasp the matter. Do you know
>> why I'm not getting any results with the query posted earlier then? It
>> gives me the folowing only:
>> 
>> <lst name="moreLikeThis">
>> <result name="18477975" numFound="0" start="0"/> </lst>
>> 
>> Instead of delivering details of the interestingTerms.
>> 
>> Thanks in advance
>> 
>> Patrick
>> 
>> 
>> -----Original Message-----
>> From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
>> Sent: woensdag 26 november 2008 13:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: Keyword extraction
>> 
>> I do not agree with you at all. The concept of MoreLikeThis is based
>> on the fundamental idea of TF-IDF weighting, and not term frequency alone.
>> Please take a look at:
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil
>> ar/MoreLikeThis.html As you can see, it is possible to use cut-off
>> thresholds to significantly reduce the number of unimportant terms,
>> and generate highly suitable queries based on the tf-idf frequency of
>> the term, since as you point out, high frequency terms alone tends to
>> be useless for querying, but taking the document frequency into
>> account drastically increases the importance of the term!
>> 
>> In solr, use parameters to manipulate your desired results:
>> http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2
>> 2ec5d1519c456b2c
>> For instance:
>> mlt.mintf - Minimum Term Frequency - the frequency below which terms
>> will be ignored in the source doc.
>> mlt.mindf - Minimum Document Frequency - the frequency at which words
>> will be ignored which do not occur in at least this many docs.
>> You can also set thresholds for term length etc.
>> 
>> Hope this gives you a better idea of things.
>> - Aleks
>> 
>> On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]>
>> wrote:
>> 
>>> Dear Partick, I had the same problem with MoreLikeThis function.
>>> 
>>> After  briefly reading and analyzing the source code of moreLikeThis
>>> function in solr, I conducted:
>>> 
>>> MoreLikeThis uses term vectors to ranks all the terms from a document
>>> by its frequency. According to its ranking, it will start to generate
>>> queries, artificially, and search for documents.
>>> 
>>> So, moreLikeThis will retrieve related documents by artificially
>>> generating queries based on most frequent terms.
>>> 
>>> There's a big problem with "most frequent terms"  from documents.
>>> Most frequent words are usually meaningless, or so called function
>>> words, or, people from Information Retrieval like to call them stopwords.
>>> However, ignoring  technical problems of implementation of
>>> moreLikeThis function, this approach is very dangerous, since queries
>>> are generated artificially based on a given document.
>>> Writting queries for retrieving a document is a human task, and it
>>> assumes some knowledge (user knows what document he wants).
>>> 
>>> I advice to use others approaches, depending on your expectation. For
>>> example, you can extract similar documents just by searching for
>>> documents with similar title (more like this doesn't work in this case).
>>> 
>>> I hope it helps,
>>> Best Regards,
>>> Vitalie Scurtu
>>> --- On Wed, 11/26/08, Plaatje, Patrick
>>> <[EMAIL PROTECTED]>
>>> wrote:
>>> From: Plaatje, Patrick <[EMAIL PROTECTED]>
>>> Subject: RE:  Keyword extraction
>>> To: solr-user@lucene.apache.org
>>> Date: Wednesday, November 26, 2008, 10:52 AM
>>> 
>>> Hi All,
>>> as an addition to my previous post, no interestingTerms are returned
>>> when i execute the folowing url:
>>> http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.inte
>>> r es tingTerms=list&mlt=true&mlt.match.include=true
>>> I get a moreLikeThis list though, any thoughts?
>>> Best,
>>> Patrick
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Aleksander M. Stensby
>> Senior software developer
>> Integrasco A/S
>> www.integrasco.no
>> 
> 
> 
> 
> --
> Aleksander M. Stensby
> Senior software developer
> Integrasco A/S
> www.integrasco.no

Re: Keyword extraction

Reply via email to