Re: facets and stopwords
hossman wrote: but are you sure that example would actually cause a problem? i suspect if you index thta exact sentence as is you wouldn't see the facet count for si or que increase at all. If you do a query for {!raw field=content}que you bypass the query parsers (which is respecting your stopwords file) and see all docs that contain the raw term que in the content field. if you look at some of the docs that match, and paste their content field into the analysis tool, i think you'll see that the problem comes from using the whitespace tokenizer, and is masked by using the WDF after the stop filter ... things like Que? are getting ignored by the stopfilter, but ultimately winding up in your index as que -Hoss Yes your are right, que? que, que... i need to change the analyzer. They are not detected by the stopwords analyzer because i use the whitespace tokenizer, I will use the StanadardTokenizer Thanks Hoss -- View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24390157.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facets and stopwords
: http://projecte01.development.barcelonamedia.org/fonetic/ : you will see a Top Words list (in Spanish and stemmed) in the list there : is the word si which is in 20649 documents. : If you click at this word, the system will perform the query : (x) content:si, with no answers at all : The same for la it is in 17881 documents, but the query content:la will : give no answers at all ... : To see what's going on on the index I have tested with the analyzer : http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp ... : las cosas que si no pasan la proxima vez si que no veràs but are you sure that example would actually cause a problem? i suspect if you index thta exact sentence as is you wouldn't see the facet count for si or que increase at all. If you do a query for {!raw field=content}que you bypass the query parsers (which is respecting your stopwords file) and see all docs that contain the raw term que in the content field. if you look at some of the docs that match, and paste their content field into the analysis tool, i think you'll see that the problem comes from using the whitespace tokenizer, and is masked by using the WDF after the stop filter ... things like Que? are getting ignored by the stopfilter, but ultimately winding up in your index as que -Hoss
Re: facets and stopwords
Sorry , I was too cryptic. I you follow this link http://projecte01.development.barcelonamedia.org/fonetic/ you will see a Top Words list (in Spanish and stemmed) in the list there is the word si which is in 20649 documents. If you click at this word, the system will perform the query (x) content:si, with no answers at all The same for la it is in 17881 documents, but the query content:la will give no answers at all the facets list is generated by the query http://projecte01.development.barcelonamedia.org/solr/select/?rows=0start=0q=*:*facet=truefacet.limit=-1facet.field=contentfacet.field=entities_miscwt=jsonjson.wrf=jsonp1246437157825jsoncallback=jsonp1246437157825_=1246437158023 but the question is why these two words (among others) are there if they are stop words? To see what's going on on the index I have tested with the analyzer http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp If I select the field content and I write the text las cosas que si no pasan la proxima vez si que no veràs i get the following tokens at the end of the analyzer las cosapasan proxima vez sí verà where que, si, no, la are removed as treated as stop words. but... in the schema browser http://projecte01.development.barcelonamedia.org/solr/admin/schema.jsp in the field content que is the 3rd word no the 4th si and la are between the top 40 terms... the analyzer for the content can be seen in this page and has the following analyzers Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory Filters: 1. org.apache.solr.analysis.StopFilterFactory args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true } 2. org.apache.solr.analysis.WordDelimiterFilterFactory args:{catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll: 0 generateNumberParts: 1 generateWordParts: 1 } 3. org.apache.solr.analysis.LowerCaseFilterFactory args:{} 4. org.apache.solr.analysis.SnowballPorterFilterFactory args:{languange: Spanish } 5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{} The field is indexed, tokenized, stored and termvectors are stored. So, why the stopwords are in the index? -- View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24286283.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facets and stopwords
: Date: Tue, 9 Jun 2009 16:04:03 -0700 (PDT) : From: JCodina : Subject: facets and stopwords : I have a text field from where I remove stop words, as a first approximation : I use facets to see the most common words in the text, but.. stopwords are : there, and if I search documents having the stopwords, then , there are no : documents in the answer. JCodina: I'm not sure if you've already figured out the solution to your problem, but in the future it would help if you could include the relevant sections from your schema.xml and solrconfig.xml when asking questions. in this case, showing us the field/fieldtype you are faceting on as well as how your request handler is declared, and what URL you are using would help people understand your problem. Even looking at the dev port URL you sent, i had no idea which field name i should be looking at to try and understand the problem. -Hoss
facets and stopwords
I have a text field from where I remove stop words, as a first approximation I use facets to see the most common words in the text, but.. stopwords are there, and if I search documents having the stopwords, then , there are no documents in the answer. You can test it in this address (using solrjs, the texts are in spanish but you can check in top words that que or en are there) but if you click on them to perform the search no results are given http://projecte01.development.barcelonamedia.org/fonetic/ or the administrator at http://projecte01.development.barcelonamedia.org/solr/admin so you can check wat's going on on the content field. I use the DataImportHandler to import the data, and Solr analyzer shows me how the stopwords are removed from both the query and the indexed text, but why facets show me these words? -- View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p23952823.html Sent from the Solr - User mailing list archive at Nabble.com.