Re: Keyword extraction
Hi again Patrick. Glad to hear that we can contribute to help you guys. Thats what this mailing list is for:) First of all, I think you use the wrong parameter to get your terms. Take a look at http://lucene.apache.org/solr/api/org/apache/solr/common/params/MoreLikeThisParams.html to see the supported params. In your string you use mlt.displayTerms=list, which i believe should be mlt.interestingTerms=list. If that doesn't work: One thing you should know is that from what i can tell, you are using the StandardRequestHandler in your querying. The StandardRequestHandler supports a simplified handling of more like these queries, namely; "This method returns similar documents for each document in the response set." it supports the common mlt parameters, needs mlt=true (as you have done) and supports a mlt.count parameter to specify the number of similar documents returned for each matching doc from your query. If you want to get the "top keywords" etc, (and in essence your mlt.interestingTerms=list parameter to have any effect at all, if I'm not completely wrong), you will need to configure up a MoreLikeThisHandler in your solrconfig.xml and then map that to your query. From the sample configuration file: incoming queries will be dispatched to the correct handler based on the path or the qt (query type) param. Names starting with a '/' are accessed with the a path equal to the registered name. Names without a leading '/' are accessed with: http://host/app/select?qt=name If no qt is defined, the requestHandler that declares default="true" will be used. You can read about the MoreLikeThisHandler here: http://wiki.apache.org/solr/MoreLikeThisHandler Once you have it configured properly your query would be something like: http://localhost:8983/solr/mlt?q=amsterdam&mlt.fl=text&mlt.interestingTerms=list&mlt=true (don't think you need the mlt=true here tho...) or http://localhost:8983/solr/select?qt=mlt&q=amsterdam&mlt.fl=text&mlt.interestingTerms=list&mlt=true (in the last example I use qt=mlt) Hope this helps. Regards, Aleksander On Thu, 27 Nov 2008 11:49:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: Hi Aleksander, With all the help of you and the other comments, we're now at a point where a MoreLikeThis list is returned, and shows 10 related records. However on the query executed there are no keywords whatsoever being returned. Is the querystring still wrong or is something else required? The querystring we're currently executing is: http://suempnr3:8080/solr/select/?q=amsterdam&mlt.fl=text&mlt.displayTerms=list&mlt=true Best, Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 15:07 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Ah, yes, That is important. In lucene, the MLT will see if the term vector is stored, and if it is not it will still be able to perform the querying, but in a much much much less efficient way.. Lucene will analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of tokens that will be parsed). (don't want to go into details on this since I haven't really dug through the code:p) But when the field isn't stored either, it is rather difficult to re-analyze the document;) On a general note, if you want to "really" understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ Regards, Aleksander On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors="true". I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote:
RE: Keyword extraction
Hi Aleksander, With all the help of you and the other comments, we're now at a point where a MoreLikeThis list is returned, and shows 10 related records. However on the query executed there are no keywords whatsoever being returned. Is the querystring still wrong or is something else required? The querystring we're currently executing is: http://suempnr3:8080/solr/select/?q=amsterdam&mlt.fl=text&mlt.displayTerms=list&mlt=true Best, Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 15:07 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Ah, yes, That is important. In lucene, the MLT will see if the term vector is stored, and if it is not it will still be able to perform the querying, but in a much much much less efficient way.. Lucene will analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of tokens that will be parsed). (don't want to go into details on this since I haven't really dug through the code:p) But when the field isn't stored either, it is rather difficult to re-analyze the document;) On a general note, if you want to "really" understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ Regards, Aleksander On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: > Hi Aleksander, > > This was a typo on my end, the original query included a semicolon > instead of an equal sign. But I think it has to do with my field not > being stored and not being identified as termVectors="true". I'm > recreating the index now, and see if this fixes the problem. > > Best, > > patrick > > -Original Message- > From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] > Sent: woensdag 26 november 2008 14:37 > To: solr-user@lucene.apache.org > Subject: Re: Keyword extraction > > Hi there! > Well, first of all i think you have an error in your query, if I'm not > mistaken. > You say http://localhost:8080/solr/select/?q=id=18477975... > but since you are referring to the field called "id", you must say: > http://localhost:8080/solr/select/?q=id:18477975... > (use colon instead of the equals sign). > I think that will do the trick. > If not, try adding the &debugQuery=on at the end of your request url, > to see debug output on how the query is parsed and if/how any > documents are matched against your query. > Hope this helps. > > Cheers, > Aleksander > > > > On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick > <[EMAIL PROTECTED]> wrote: > >> Hi Aleksander, >> >> Thanx for clearing this up. I am confident that this is a way to >> explore for me as I'm just starting to grasp the matter. Do you know >> why I'm not getting any results with the query posted earlier then? >> It gives me the folowing only: >> >> >> >> >> Instead of delivering details of the interestingTerms. >> >> Thanks in advance >> >> Patrick >> >> >> -Original Message- >> From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] >> Sent: woensdag 26 november 2008 13:03 >> To: solr-user@lucene.apache.org >> Subject: Re: Keyword extraction >> >> I do not agree with you at all. The concept of MoreLikeThis is based >> on the fundamental idea of TF-IDF weighting, and not term frequency >> alone. >> Please take a look at: >> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi >> l ar/MoreLikeThis.html As you can see, it is possible to use cut-off >> thresholds to significantly reduce the number of unimportant terms, >> and generate highly suitable queries based on the tf-idf frequency of >> the term, since as you point out, high frequency terms alone tends to >> be useless for querying, but taking the document frequency into >> account drastically increases the importance of the term! >> >> In solr, use parameters to manipulate your desired results: >> http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e >> 2 >> 2ec5d1519c456b2c >> For instance: >> mlt.mintf - Minimum Term Frequency - the frequency below which terms >> will be ignored in the source doc. >> mlt.mindf - Minimum Document Frequency - the frequency at which words >> will be ignored which do not occur in at least this many docs. >> You can also set thresholds for term length etc. >> >> Hope this gives you a better idea of things. >> - Aleks >>
Re: Keyword extraction
Sorry for not writing clearly. Yes, it works good for its purpose, and I didn't want to say that moreLikeThis component does not work at all. In the same time it's good to know what are the limitations and the problems of moreLikeThis function. What I want to point out is that queries_generation is one of fundamental problems in Information Retrieval, and independent of the implementation of moreLikeThis function, it can give inappropriate results. Best Wishes, Vitalie Scurtu --- On Wed, 11/26/08, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote: From: Aleksander M. Stensby <[EMAIL PROTECTED]> Subject: Re: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 2:43 PM I'm sure that for certain problems and cases you will need to do quite a bit tweaking to make it work (to suite your needs), but i responded to your statement because you made it sound like the MoreLikeThis component does not work at all for its purpuse, while it actually do work as intended and can be of great aid in constructing queries to retrieve same-topic-documents etc. - Aleksander On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: > Yes, I totally understand, and agree. > > MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on top ranked terms. In any case, I wasn't able to make it work after many attempts. > > Finally, I've used a different method for queries generation, and it works better, or at least gives some results, while with moreLikeThis results were poor or no result at all. > > To mention that my index was composed by short length documents, therefore the intersection between top ranked terms by TF-IDF was empty set. MoreLikeThis works better when you have long documents. > > Yes, I've changed the thresholds for min TFIDF and max TFIDF, and others parameters. > > I've also used "mlt.maxqt" parameter to increase the number of terms used in queries generation, but still didn't work well, since the method of queries generation based on terms with the highest TF-IDF score doesn't generate representative query for document. I wasn't able to tune it. For a low value such as mlt.maxqt=3,4, results were poor, while for mlt.maxqt=5,6>>> it gave too many and irrelevant results. > > > > Thank you, > Best Wishes, > Vitalie Scurtu > > > > --- On Wed, 11/26/08, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote: > From: Aleksander M. Stensby [EMAIL PROTECTED]> > Subject: Re: Keyword extraction > To: solr-user@lucene.apache.org > Date: Wednesday, November 26, 2008, 1:03 PM > > I do not agree with you at all. The concept of MoreLikeThis is based on the > fundamental idea of TF-IDF weighting, and not term frequency alone. > Please take a look at: > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html > As you can see, it is possible to use cut-off thresholds to significantly > reduce the number of unimportant terms, and generate highly suitable queries > based on the tf-idf frequency of the term, since as you point out, high > frequency terms alone tends to be useless for querying, but taking the document > frequency into account drastically increases the importance of the term! > > In solr, use parameters to manipulate your desired results: > http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c > For instance: > mlt.mintf - Minimum Term Frequency - the frequency below which terms will be > ignored in the source doc. > mlt.mindf - Minimum Document Frequency - the frequency at which words will be > ignored which do not occur in at least this many docs. > You can also set thresholds for term length etc. > > Hope this gives you a better idea of things. > - Aleks > > On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> > wrote: > >> Dear Partick, I had the same problem with MoreLikeThis function. >> >> After briefly reading and analyzing the source code of moreLikeThis > function in solr, I conducted: >> >> MoreLikeThis uses term vectors to ranks all the terms from a document >> by its frequency. According to its ranking, it will start to generate >> queries, artificially, and search for documents. >> >> So, moreLikeThis will retrieve related documents by artificially > generating queries based on most frequent terms. >> >> There's a big problem with "most frequent terms" from > documents. Most frequent words are usually meaningless, or so called function > words, or, people from Information Retrieval like to call them stopwords. > However, ignoring technical problems of implementatio
Re: Keyword extraction
Unfortunately, as it stands the interestingTerms and the debugQuery do not explain why solr chose the matches it did for moreLikeThis. There is currently a task in jira to try to add the information to debugQuery. The ticket can be found here: https://issues.apache.org/jira/browse/SOLR-860 -Jeff On 11/26/08 5:41 AM, "Plaatje, Patrick" <[EMAIL PROTECTED]> wrote: > Hi Aleksander, > > This was a typo on my end, the original query included a semicolon instead of > an equal sign. But I think it has to do with my field not being stored and not > being identified as termVectors="true". I'm recreating the index now, and see > if this fixes the problem. > > Best, > > patrick > > -Original Message- > From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] > Sent: woensdag 26 november 2008 14:37 > To: solr-user@lucene.apache.org > Subject: Re: Keyword extraction > > Hi there! > Well, first of all i think you have an error in your query, if I'm not > mistaken. > You say http://localhost:8080/solr/select/?q=id=18477975... > but since you are referring to the field called "id", you must say: > http://localhost:8080/solr/select/?q=id:18477975... > (use colon instead of the equals sign). > I think that will do the trick. > If not, try adding the &debugQuery=on at the end of your request url, to see > debug output on how the query is parsed and if/how any documents are matched > against your query. > Hope this helps. > > Cheers, > Aleksander > > > > On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick > <[EMAIL PROTECTED]> wrote: > >> Hi Aleksander, >> >> Thanx for clearing this up. I am confident that this is a way to >> explore for me as I'm just starting to grasp the matter. Do you know >> why I'm not getting any results with the query posted earlier then? It >> gives me the folowing only: >> >> >> >> >> Instead of delivering details of the interestingTerms. >> >> Thanks in advance >> >> Patrick >> >> >> -Original Message- >> From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] >> Sent: woensdag 26 november 2008 13:03 >> To: solr-user@lucene.apache.org >> Subject: Re: Keyword extraction >> >> I do not agree with you at all. The concept of MoreLikeThis is based >> on the fundamental idea of TF-IDF weighting, and not term frequency alone. >> Please take a look at: >> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil >> ar/MoreLikeThis.html As you can see, it is possible to use cut-off >> thresholds to significantly reduce the number of unimportant terms, >> and generate highly suitable queries based on the tf-idf frequency of >> the term, since as you point out, high frequency terms alone tends to >> be useless for querying, but taking the document frequency into >> account drastically increases the importance of the term! >> >> In solr, use parameters to manipulate your desired results: >> http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 >> 2ec5d1519c456b2c >> For instance: >> mlt.mintf - Minimum Term Frequency - the frequency below which terms >> will be ignored in the source doc. >> mlt.mindf - Minimum Document Frequency - the frequency at which words >> will be ignored which do not occur in at least this many docs. >> You can also set thresholds for term length etc. >> >> Hope this gives you a better idea of things. >> - Aleks >> >> On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> >> wrote: >> >>> Dear Partick, I had the same problem with MoreLikeThis function. >>> >>> After briefly reading and analyzing the source code of moreLikeThis >>> function in solr, I conducted: >>> >>> MoreLikeThis uses term vectors to ranks all the terms from a document >>> by its frequency. According to its ranking, it will start to generate >>> queries, artificially, and search for documents. >>> >>> So, moreLikeThis will retrieve related documents by artificially >>> generating queries based on most frequent terms. >>> >>> There's a big problem with "most frequent terms" from documents. >>> Most frequent words are usually meaningless, or so called function >>> words, or, people from Information Retrieval like to call them stopwords. >>> However, ignoring technical problems of implementation of >>> moreLikeThis function, this approach is very dangerous, since queries >&g
Re: Keyword extraction
Ah, yes, That is important. In lucene, the MLT will see if the term vector is stored, and if it is not it will still be able to perform the querying, but in a much much much less efficient way.. Lucene will analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of tokens that will be parsed). (don't want to go into details on this since I haven't really dug through the code:p) But when the field isn't stored either, it is rather difficult to re-analyze the document;) On a general note, if you want to "really" understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ Regards, Aleksander On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors="true". I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil ar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 2ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (m
Re: Keyword extraction
I'm sure that for certain problems and cases you will need to do quite a bit tweaking to make it work (to suite your needs), but i responded to your statement because you made it sound like the MoreLikeThis component does not work at all for its purpuse, while it actually do work as intended and can be of great aid in constructing queries to retrieve same-topic-documents etc. - Aleksander On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: Yes, I totally understand, and agree. MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on top ranked terms. In any case, I wasn't able to make it work after many attempts. Finally, I've used a different method for queries generation, and it works better, or at least gives some results, while with moreLikeThis results were poor or no result at all. To mention that my index was composed by short length documents, therefore the intersection between top ranked terms by TF-IDF was empty set. MoreLikeThis works better when you have long documents. Yes, I've changed the thresholds for min TFIDF and max TFIDF, and others parameters. I've also used "mlt.maxqt" parameter to increase the number of terms used in queries generation, but still didn't work well, since the method of queries generation based on terms with the highest TF-IDF score doesn't generate representative query for document. I wasn't able to tune it. For a low value such as mlt.maxqt=3,4, results were poor, while for mlt.maxqt=5,6>>> it gave too many and irrelevant results. Thank you, Best Wishes, Vitalie Scurtu --- On Wed, 11/26/08, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote: From: Aleksander M. Stensby Subject: Re: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 1:03 PM I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: From: Plaatje, Patrick <[EMAIL PROTECTED]> Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick --Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
RE: Keyword extraction
Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors="true". I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: > Hi Aleksander, > > Thanx for clearing this up. I am confident that this is a way to > explore for me as I'm just starting to grasp the matter. Do you know > why I'm not getting any results with the query posted earlier then? It > gives me the folowing only: > > > > > Instead of delivering details of the interestingTerms. > > Thanks in advance > > Patrick > > > -Original Message- > From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] > Sent: woensdag 26 november 2008 13:03 > To: solr-user@lucene.apache.org > Subject: Re: Keyword extraction > > I do not agree with you at all. The concept of MoreLikeThis is based > on the fundamental idea of TF-IDF weighting, and not term frequency alone. > Please take a look at: > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil > ar/MoreLikeThis.html As you can see, it is possible to use cut-off > thresholds to significantly reduce the number of unimportant terms, > and generate highly suitable queries based on the tf-idf frequency of > the term, since as you point out, high frequency terms alone tends to > be useless for querying, but taking the document frequency into > account drastically increases the importance of the term! > > In solr, use parameters to manipulate your desired results: > http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 > 2ec5d1519c456b2c > For instance: > mlt.mintf - Minimum Term Frequency - the frequency below which terms > will be ignored in the source doc. > mlt.mindf - Minimum Document Frequency - the frequency at which words > will be ignored which do not occur in at least this many docs. > You can also set thresholds for term length etc. > > Hope this gives you a better idea of things. > - Aleks > > On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> > wrote: > >> Dear Partick, I had the same problem with MoreLikeThis function. >> >> After briefly reading and analyzing the source code of moreLikeThis >> function in solr, I conducted: >> >> MoreLikeThis uses term vectors to ranks all the terms from a document >> by its frequency. According to its ranking, it will start to generate >> queries, artificially, and search for documents. >> >> So, moreLikeThis will retrieve related documents by artificially >> generating queries based on most frequent terms. >> >> There's a big problem with "most frequent terms" from documents. >> Most frequent words are usually meaningless, or so called function >> words, or, people from Information Retrieval like to call them stopwords. >> However, ignoring technical problems of implementation of >> moreLikeThis function, this approach is very dangerous, since queries >> are generated artificially based on a given document. >> Writting queries for retrieving a document is a human task, and it >> assumes some knowledge (user knows what document he wants). >> >> I advice to use others approaches, depending on your expectation. For >> example, you can extract similar documents just by searching for >> documents with similar title (more like this doesn't work in this case). >> >> I hope it helps, >> Best Regards, >> Vitalie Scurtu >> --- On Wed, 11/26/08, Plaatje, Patrick >> <[EMAIL PROTECTED]> >> wrote: >> From: Plaatje, Patrick <[EMAIL PROTECTED]> >> Subject: RE: Keyword extraction >> To: solr-user@lucene.apache.org >> Date: Wednesday, November 26, 2008, 10:52 AM >
Re: Keyword extraction
Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: From: Plaatje, Patrick <[EMAIL PROTECTED]> Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.inter es tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
Re: Keyword extraction
Yes, I totally understand, and agree. MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on top ranked terms. In any case, I wasn't able to make it work after many attempts. Finally, I've used a different method for queries generation, and it works better, or at least gives some results, while with moreLikeThis results were poor or no result at all. To mention that my index was composed by short length documents, therefore the intersection between top ranked terms by TF-IDF was empty set. MoreLikeThis works better when you have long documents. Yes, I've changed the thresholds for min TFIDF and max TFIDF, and others parameters. I've also used "mlt.maxqt" parameter to increase the number of terms used in queries generation, but still didn't work well, since the method of queries generation based on terms with the highest TF-IDF score doesn't generate representative query for document. I wasn't able to tune it. For a low value such as mlt.maxqt=3,4, results were poor, while for mlt.maxqt=5,6>>> it gave too many and irrelevant results. Thank you, Best Wishes, Vitalie Scurtu --- On Wed, 11/26/08, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote: From: Aleksander M. Stensby Subject: Re: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 1:03 PM I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: > Dear Partick, I had the same problem with MoreLikeThis function. > > After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: > > MoreLikeThis uses term vectors to ranks all the terms from a document > by its frequency. According to its ranking, it will start to generate > queries, artificially, and search for documents. > > So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. > > There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. > Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). > > I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). > > I hope it helps, > Best Regards, > Vitalie Scurtu > --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: > From: Plaatje, Patrick <[EMAIL PROTECTED]> > Subject: RE: Keyword extraction > To: solr-user@lucene.apache.org > Date: Wednesday, November 26, 2008, 10:52 AM > > Hi All, > as an addition to my previous post, no interestingTerms are returned > when i execute the folowing url: > http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes > tingTerms=list&mlt=true&mlt.match.include=true > I get a moreLikeThis list though, any thoughts? > Best, > Patrick > > > > --Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
Re: Keyword extraction
You might also be interested in http://wiki.apache.org/solr/TermVectorComponent On Wed, Nov 26, 2008 at 12:39 AM, Plaatje, Patrick < [EMAIL PROTECTED]> wrote: > Hi all, > > Strugling with a question I recently got from a collegue: is it possible > to extract keywords from indexed content? > > In my opinion it should be possible to find out on what words the > ranking of the indexed content is the highest (Lucene or Solr), but have > no clue where to begin. Anyone having suggestions? > > Best, > > Patrick > -- Regards, Shalin Shekhar Mangar.
RE: Keyword extraction
Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: > Dear Partick, I had the same problem with MoreLikeThis function. > > After briefly reading and analyzing the source code of moreLikeThis > function in solr, I conducted: > > MoreLikeThis uses term vectors to ranks all the terms from a document > by its frequency. According to its ranking, it will start to generate > queries, artificially, and search for documents. > > So, moreLikeThis will retrieve related documents by artificially > generating queries based on most frequent terms. > > There's a big problem with "most frequent terms" from documents. Most > frequent words are usually meaningless, or so called function words, > or, people from Information Retrieval like to call them stopwords. > However, ignoring technical problems of implementation of > moreLikeThis function, this approach is very dangerous, since queries > are generated artificially based on a given document. > Writting queries for retrieving a document is a human task, and it > assumes some knowledge (user knows what document he wants). > > I advice to use others approaches, depending on your expectation. For > example, you can extract similar documents just by searching for > documents with similar title (more like this doesn't work in this case). > > I hope it helps, > Best Regards, > Vitalie Scurtu > --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> > wrote: > From: Plaatje, Patrick <[EMAIL PROTECTED]> > Subject: RE: Keyword extraction > To: solr-user@lucene.apache.org > Date: Wednesday, November 26, 2008, 10:52 AM > > Hi All, > as an addition to my previous post, no interestingTerms are returned > when i execute the folowing url: > http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.inter > es tingTerms=list&mlt=true&mlt.match.include=true > I get a moreLikeThis list though, any thoughts? > Best, > Patrick > > > > -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
Re: Keyword extraction
I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: From: Plaatje, Patrick <[EMAIL PROTECTED]> Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
RE: Keyword extraction
Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: From: Plaatje, Patrick <[EMAIL PROTECTED]> Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick
RE: Keyword extraction
Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick
Re: Keyword extraction
lots of approaches out there... the easiest "off the shelf" method would be to use the MoreLikeThisHandler and get the top "interesting" terms; http://wiki.apache.org/solr/MoreLikeThisHandler ryan On Nov 25, 2008, at 2:09 PM, Plaatje, Patrick wrote: Hi all, Strugling with a question I recently got from a collegue: is it possible to extract keywords from indexed content? In my opinion it should be possible to find out on what words the ranking of the indexed content is the highest (Lucene or Solr), but have no clue where to begin. Anyone having suggestions? Best, Patrick