Re: processing documents in solr
I'll try reindexing the timestamp. The id-creation approach suggested by Erick sounds attractive, but the nutch/solr integration seems rather tight. I don't where to break in to insert the id into solr. On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson erickerick...@gmail.comwrote: No SolrJ doesn't provide this automatically. You'd be providing the counter by inserting it into the document as you created new docs. You could do this with any kind of document creation you are using. Best Erick On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com wrote: Hi, The easiest solution would be to have timestamp indexed. Is there any issue in doing re-indexing? If you want to process records in batch then you need a ordered list and a bookmark. You require a field to sort and maintain a counter / last id as bookmark. This is mandatory to solve your problem. If you don't want to re-index, then you need to maintain information related to visited nodes. Have a database / solr core which maintains list of IDs which already processed. Fetch record from Solr, For each record, check the new DB, if the record is already processed. Regards Aditya www.findbestopensource.com On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote: Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote: I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: processing documents in solr
I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.comwrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: processing documents in solr
Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote: I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.comwrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: processing documents in solr
On a related, inspired by what you said, Shawn, an auto increment id seems perfect here. Yet I found there is no such support in solr. The UUID only guarantees uniqueness. On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your kind reply, Shawn. On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2013 11:02 PM, Joe Zhang wrote: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. If your index isn't very big, a *:* query with rows and start parameters is perfectly acceptable. Performance is terrible for this method when the index gets huge, though. == Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time? If id is your uniqueKey field, here's how you can do it. If that's not your uniqueKey field, substitute your uniqueKey field for id. This method doesn't work properly if you don't use a field with values that are guaranteed to be unique. For the first query, send a query with these parameters, where NN is the number of docs you want to retrieve at once: q=*:*rows=NNsort=id asc For each subsequent query, use the following parameters, where XXX is the highest id value seen in the previous query: q={XXX TO *}rows=NNsort=id asc == This approach seems to require that the id field is numerical, right? I have a text-based id that is unique. == I'm not sure I understand the q={XXX TO *} part -- wouldn't query be matched against the default search field, which could be content, for example? How would that do the job? As soon as you see a numFound value less than NN, you will know that there's no more data. Generally speaking, you'd want to avoid updating the index while doing these queries. If you never replace existing documents and you can guarantee that the value in the uniqueKey field for new documents will always be higher than any previous value, then you could continue updating the index. A database autoincrement field would qualify for that condition. Thanks, Shawn
Re: processing documents in solr
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2013 11:50 PM, Joe Zhang wrote: == Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time? Yes, it's pagination. Just like the other method that I've described in detail, you'd have to avoid updating the index while you were getting information. Unless you can come up with a sort parameter that's guaranteed to make sure that new documents are at the end, any changes to the index during the retrieval process will make it impossible to retrieve every document. == What I can guarantee is that there is no deletion, but I guess this is not equivalent to newly added docs are at the end, right? == I believe you are right about performance. The retrived set becomes larger and larger. == This approach seems to require that the id field is numerical, right? I have a text-based id that is unique. StrField types work perfectly with range queries. As long as it's not a tokenized field, TextField works properly with range queries too. KeywordTokenizer is OK, as long you don't use filters that create additional tokens. Some examples that create additional tokens are WordDelimiterFilter and EdgeNgramFilter. == so a url field would work fine? == I'm not sure I understand the q={XXX TO *} part -- wouldn't query be matched against the default search field, which could be content, for example? How would that do the job? You are correct, I was too hasty in constructing the query. That should be: q=id:{XXX TO *}rows=NNsort=id asc You could speed things up if you don't need to see all stored fields in the response by using the fl parameter to only return the fields that you need. Responding to your additional message about an autoincrement field - that would only be possible if you are importing from a data source that supports autoincrement, like MySQL. Solr itself has no support for autoincrement. Thanks, Shawn
Re: processing documents in solr
Thanks. On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 12:30 AM, Joe Zhang wrote: == so a url field would work fine? As long as it's guaranteed unique on every document (especially if it is your uniqueKey) and goes into the index as a single token, that should work just fine for the range queries I've described. Thanks, Shawn
Re: processing documents in solr
Thanks for sharing, Roman. I'll look into your code. One more thought on your suggestion, Shawn. In fact, for the id, we need more than unique and rangeable; we also need some sense of atomic values. Your approach might run into risk with a text-based id field, say: the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your suggestion would work fine. But with newly added documents, there is no guarantee that they are not going to use the key value 'b'. And this new document would be missed in your algorithm, right? On Sat, Jul 27, 2013 at 5:32 AM, Roman Chyla roman.ch...@gmail.com wrote: Dear list, I'vw written a special processor exactly for this kind of operations https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of processing index of 200gb in few minutes, copying/streaming large amounts of data is normal If there is general interest, we can create jira issue - but given my current workload time, it will take longer and also somebody else will *have to* invest their time and energy in testing it, reporting, etc. Of course, feel free to create the jira yourself or reuse the code - hopefully, you will improve it and let me know ;-) Roman On 27 Jul 2013 01:03, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: processing documents in solr
I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla *:*+pagination approach, would the ordering of documents remain stable? the index is dynamic: update/insertion only, no deletion. On Sat, Jul 27, 2013 at 10:28 AM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 11:17 AM, Joe Zhang wrote: Thanks for sharing, Roman. I'll look into your code. One more thought on your suggestion, Shawn. In fact, for the id, we need more than unique and rangeable; we also need some sense of atomic values. Your approach might run into risk with a text-based id field, say: the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your suggestion would work fine. But with newly added documents, there is no guarantee that they are not going to use the key value 'b'. And this new document would be missed in your algorithm, right? That's why I said that you would either have to not update the index or ensure that (in your example) a 'b' document never gets added. Because you can't make that kind of guarantee in most situations, not updating the index is safer. Thanks, Shawn
processing documents in solr
Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: processing documents in solr
Thanks for your kind reply, Shawn. On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2013 11:02 PM, Joe Zhang wrote: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. If your index isn't very big, a *:* query with rows and start parameters is perfectly acceptable. Performance is terrible for this method when the index gets huge, though. == Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time? If id is your uniqueKey field, here's how you can do it. If that's not your uniqueKey field, substitute your uniqueKey field for id. This method doesn't work properly if you don't use a field with values that are guaranteed to be unique. For the first query, send a query with these parameters, where NN is the number of docs you want to retrieve at once: q=*:*rows=NNsort=id asc For each subsequent query, use the following parameters, where XXX is the highest id value seen in the previous query: q={XXX TO *}rows=NNsort=id asc == This approach seems to require that the id field is numerical, right? I have a text-based id that is unique. == I'm not sure I understand the q={XXX TO *} part -- wouldn't query be matched against the default search field, which could be content, for example? How would that do the job? As soon as you see a numFound value less than NN, you will know that there's no more data. Generally speaking, you'd want to avoid updating the index while doing these queries. If you never replace existing documents and you can guarantee that the value in the uniqueKey field for new documents will always be higher than any previous value, then you could continue updating the index. A database autoincrement field would qualify for that condition. Thanks, Shawn
Re: Question about field boost
I'm not sure I understand, Erick. I don't have a text field in my schema; title and content are both legal fields. On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson erickerick...@gmail.comwrote: this isn't doing what you think. title^10 content is actually parsed as text:title^100 text:content where text is my default search field. assuming title is a field. If you look a little farther up the debug output you'll see that. You probably want title:content^100 or some such? Erick On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote: That means that for that document china occurs in the title vs. snowden found in a document but not in the title. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Tuesday, July 23, 2013 12:52 AM To: solr-user@lucene.apache.org Subject: Re: Question about field boost Is my reading correct that the boost is only applied on china but not snowden? How can that be? My query is: q=china+snowdenqf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
Question about field boost
Dear Solr experts: Here is my query: defType=dismaxq=term1+term2qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
Re: Question about field boost
Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
Re: Question about field boost
Is my reading correct that the boost is only applied on china but not snowden? How can that be? My query is: q=china+snowdenqf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
zero-valued retrieval scores
when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html; 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
Yes, you are right, the boost on these documents are 0. I didn't provide them, though. I suppose the boost scores come from Nutch (yes, my solr indexes crawled web docs). What could be wrong? again, what exactly is the formula for fieldNorm? On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.comwrote: Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
Thanks, Jack! On Fri, Jul 12, 2013 at 9:37 PM, Jack Krupansky j...@basetechnology.comwrote: For the calculation of norm, see note number 6: http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/** search/similarities/**TFIDFSimilarity.htmlhttp://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html You would need to talk to the Nutch guys to see why THEY are setting document boost to 0.0. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 11:57 PM To: solr-user@lucene.apache.org Subject: Re: zero-valued retrieval scores Yes, you are right, the boost on these documents are 0. I didn't provide them, though. I suppose the boost scores come from Nutch (yes, my solr indexes crawled web docs). What could be wrong? again, what exactly is the formula for fieldNorm? On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.com* *wrote: Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/slideshow/2013-07-12/world-at-** **work-india.htmlhttp://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html http://www.**bloomberg.com/slideshow/2013-** 07-12/world-at-work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html ** 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: document id in nutch/solr
Can somebody help with this one, please? On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote: A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
Re: what does a zero score mean?
So, the reason is taht I'm getting zero values on FieldNorm. documentation tells me that there are 3 factors in play here: - LengthNorm -- can this be zero? - index-time boost -- is this the boost value we get from nutch? - field-boost -- none specified. Can somebody help here? On Tue, Jun 18, 2013 at 7:22 AM, Upayavira u...@odoko.co.uk wrote: debugQuery=true adds an extra block of XML to the bottom that will give you extra info. Alternatively, add fl=*,[explain] to your URL. That'll give you an extra field in your output. Then, view the source to see it structured properly. Upayavira On Tue, Jun 18, 2013, at 02:52 PM, Joe Zhang wrote: I did include debugQuery=on in the query, but nothing extra showed up in the response. On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 10:49, Joe Zhang smartag...@gmail.com wrote: I issued a simple query (apple) to my collection and got 201 documents back, all of which are scored 0. What does this mean? --- The documents do contain the query words. My guess is that the float-valued score is getting converted to an integer. You could also try your query with the parameter debugQuery=on to get an explanation of the scoring: http://wiki.apache.org/solr/CommonQueryParameters#debugQuery Regards, Gora
document id in nutch/solr
A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
Re: what does a zero score mean?
I did include debugQuery=on in the query, but nothing extra showed up in the response. On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 10:49, Joe Zhang smartag...@gmail.com wrote: I issued a simple query (apple) to my collection and got 201 documents back, all of which are scored 0. What does this mean? --- The documents do contain the query words. My guess is that the float-valued score is getting converted to an integer. You could also try your query with the parameter debugQuery=on to get an explanation of the scoring: http://wiki.apache.org/solr/CommonQueryParameters#debugQuery Regards, Gora
what does a zero score mean?
I issued a simple query (apple) to my collection and got 201 documents back, all of which are scored 0. What does this mean? --- The documents do contain the query words.
Re: Internal statistics in Solr index?
Thank you very much! This is a good starting point! On Fri, Dec 21, 2012 at 6:15 AM, Erick Erickson erickerick...@gmail.comwrote: Have you seen the functions here: http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions Best Erick On Thu, Dec 20, 2012 at 1:18 PM, Joe Zhang smartag...@gmail.com wrote: Dear list, Is there any way to access things such as word frequency, doc frequency in solr index? Thanks!
Re: behavior of solr.KeepWordFilterFactory
across-the-board case-senstive indexing is not what I want... Let me make sure I understand your suggestion: fieldType name=text1 class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text2 class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType And define content1 as text1, content2 as text2? On Mon, Dec 3, 2012 at 1:09 AM, Xi Shen davidshe...@gmail.com wrote: Solr index is case-sensitive by default, unless you used the lower case filter. I remember I saw this topic on Solr, and the solution is simple: copy the filed; use a new analyzer/tokenizer to process this field, and do not use lower case filter when query, make sure both fields are included. On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang smartag...@gmail.com wrote: In other words, what I wanted to achieve is case-senstive indexing on a small set of words. Can anybody help? On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote: To be more specific, this is the data type I was using: fieldType name=textspecial class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.KeepWordFilterFactory words=tickers.txt ignoreCase=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote: yes, that is the correct behavior. But how do I achieve my goal, i.e, speical treatment on a list of uppercase/special words, normal treatment on everything else? On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote: By the definition on https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html , I am pretty sure it is the correct behavior of this filter :) I guess you are trying to this filter to index some special words in Chinese? On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote: I defined the following data type in my solr schema.xml fieldtype name=testkeep class=solr.TextField analyzer filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=false/ /analyzer /fieldtype when I use the type testkeep to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the closed list is achieved, but NO OTHER WORD outside the list is indexed! Can anybody help? Thanks in advance! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84 -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
search behavior on a case-sensitive field
I have a search like this: fieldType name=text_cs class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !--filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType When I query COST, it gives reasonable results (n1); When I query CoSt, however, it gives me n2 (n1) results, and I can't locate actual occurence of CoSt in the docs at all. Can anybody advise?
Re: search behavior on a case-sensitive field
haha, makes perfect sense! Thanks a lot! On Mon, Dec 3, 2012 at 9:25 PM, Jack Krupansky j...@basetechnology.comwrote: CoSt was split into two terms and the query parser generated an OR of them. Adding the autoGeneratePhraseQueries=**true attribute to your field type should fix the problem. You can also change splitOnCaseChange=1 to splitOnCaseChange=0 to avoid the term splitting issue. Be sure to completely reindex in either case. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, December 03, 2012 11:10 PM To: solr-user@lucene.apache.org Subject: search behavior on a case-sensitive field I have a search like this: fieldType name=text_cs class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.**WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !--filter class=solr.**LowerCaseFilterFactory/ -- filter class=solr.**EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.**RemoveDuplicatesTokenFilterFac** tory/ /analyzer /fieldType When I query COST, it gives reasonable results (n1); When I query CoSt, however, it gives me n2 (n1) results, and I can't locate actual occurence of CoSt in the docs at all. Can anybody advise?
Re: multiple indexes?
This is very helpful. Thanks a lot, Shaun and Dikchant! So in default single-core situation, the index would live in data/index, correct? On Fri, Nov 30, 2012 at 11:02 PM, Shawn Heisey s...@elyograg.org wrote: On 11/30/2012 10:11 PM, Joe Zhang wrote: May I ask: how to set up multiple indexes, and specify which index to send the docs to at indexing time, and later on, how to specify which index to work with? A related question: what is the storage location and structure of solr indexes? When you index or query data, you'll use a base URL specific to the index (core). Everything goes through that base URL, which includes the name of the core: http://server:port/solr/**corename The file called solr.xml tells Solr about multiple cores.Each core has an instanceDir and a dataDir. http://wiki.apache.org/solr/**CoreAdminhttp://wiki.apache.org/solr/CoreAdmin In the dataDir, Solr will create an index dir, which contains the Lucene index. Here are the file formats for recent versions: http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/** codecs/lucene40/package-**summary.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html http://lucene.apache.org/core/**3_6_1/fileformats.htmlhttp://lucene.apache.org/core/3_6_1/fileformats.html http://lucene.apache.org/core/**old_versioned_docs/versions/3_** 5_0/fileformats.htmlhttp://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html Thanks, Shawn
behavior of solr.KeepWordFilterFactory
I defined the following data type in my solr schema.xml fieldtype name=testkeep class=solr.TextField analyzer filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=false/ /analyzer /fieldtype when I use the type testkeep to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the closed list is achieved, but NO OTHER WORD outside the list is indexed! Can anybody help? Thanks in advance! Joe
Re: duplicated URL sent from Nutch to solr index
Thanks! On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen davidshe...@gmail.com wrote: If the value for id field is the same, the old entry will be update; if it is new, a new entry will be created indexed. This is my experience. :) On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang smartag...@gmail.com wrote: Dear list, I just want to confirm an expected behavior of solr: Assuming we have uniqueKeyid/uniqueKey in schema.xml for solr, when we send the same URL from nutch to solr multiple times. would there be ONLY ONE entry for that URL, but the content (if changed) and timestamp would be updated? Thanks! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
Re: duplicated URL sent from Nutch to solr index
Sorry I didn't make it perfectly clear. The id field is URL. On Sun, Dec 2, 2012 at 11:33 PM, Joe Zhang smartag...@gmail.com wrote: Thanks! On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen davidshe...@gmail.com wrote: If the value for id field is the same, the old entry will be update; if it is new, a new entry will be created indexed. This is my experience. :) On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang smartag...@gmail.com wrote: Dear list, I just want to confirm an expected behavior of solr: Assuming we have uniqueKeyid/uniqueKey in schema.xml for solr, when we send the same URL from nutch to solr multiple times. would there be ONLY ONE entry for that URL, but the content (if changed) and timestamp would be updated? Thanks! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
Re: behavior of solr.KeepWordFilterFactory
yes, that is the correct behavior. But how do I achieve my goal, i.e, speical treatment on a list of uppercase/special words, normal treatment on everything else? On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote: By the definition on https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html , I am pretty sure it is the correct behavior of this filter :) I guess you are trying to this filter to index some special words in Chinese? On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote: I defined the following data type in my solr schema.xml fieldtype name=testkeep class=solr.TextField analyzer filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=false/ /analyzer /fieldtype when I use the type testkeep to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the closed list is achieved, but NO OTHER WORD outside the list is indexed! Can anybody help? Thanks in advance! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
Re: behavior of solr.KeepWordFilterFactory
To be more specific, this is the data type I was using: fieldType name=textspecial class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.KeepWordFilterFactory words=tickers.txt ignoreCase=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote: yes, that is the correct behavior. But how do I achieve my goal, i.e, speical treatment on a list of uppercase/special words, normal treatment on everything else? On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote: By the definition on https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html , I am pretty sure it is the correct behavior of this filter :) I guess you are trying to this filter to index some special words in Chinese? On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote: I defined the following data type in my solr schema.xml fieldtype name=testkeep class=solr.TextField analyzer filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=false/ /analyzer /fieldtype when I use the type testkeep to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the closed list is achieved, but NO OTHER WORD outside the list is indexed! Can anybody help? Thanks in advance! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
Re: behavior of solr.KeepWordFilterFactory
In other words, what I wanted to achieve is case-senstive indexing on a small set of words. Can anybody help? On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote: To be more specific, this is the data type I was using: fieldType name=textspecial class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.KeepWordFilterFactory words=tickers.txt ignoreCase=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote: yes, that is the correct behavior. But how do I achieve my goal, i.e, speical treatment on a list of uppercase/special words, normal treatment on everything else? On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote: By the definition on https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html , I am pretty sure it is the correct behavior of this filter :) I guess you are trying to this filter to index some special words in Chinese? On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote: I defined the following data type in my solr schema.xml fieldtype name=testkeep class=solr.TextField analyzer filter class=solr.KeepWordFilterFactory words=keepwords.txt ignoreCase=false/ /analyzer /fieldtype when I use the type testkeep to index a test field, my true expecation was to make sure solr indexes the uppercase form of a small list of words in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the closed list is achieved, but NO OTHER WORD outside the list is indexed! Can anybody help? Thanks in advance! Joe -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84
Re: stopwords in solr
that is really strange. so basic stopwords such as a the' are not eliminated from the index? On Tue, Nov 27, 2012 at 11:16 PM, 曹霖 cao...@babytree-inc.com wrote: justt no stopwords are considered in that case 2012/11/28 Joe Zhang smartag...@gmail.com t no stopwords are considered in this case