Re: facets and stopwords

2009-07-08 Thread JCodina



hossman wrote:
 
 
 but are you sure that example would actually cause a problem?
 i suspect if you index thta exact sentence as is you wouldn't see the 
 facet count for si or que increase at all.
 
 If you do a query for {!raw field=content}que you bypass the query 
 parsers (which is respecting your stopwords file) and see all docs that 
 contain the raw term que in the content field.
 
 if you look at some of the docs that match, and paste their content field 
 into the analysis tool, i think you'll see that the problem comes from 
 using the whitespace tokenizer, and is masked by using the WDF 
 after the stop filter ... things like Que? are getting ignored by the 
 stopfilter, but ultimately winding up in your index as que
 
 
 -Hoss
 
 

Yes your are right, que? que, que... i need to change the analyzer. They are
not detected by the stopwords analyzer because i use the whitespace
tokenizer, I will use the StanadardTokenizer

Thanks Hoss

-- 
View this message in context: 
http://www.nabble.com/facets-and-stopwords-tp23952823p24390157.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: facets and stopwords

2009-07-07 Thread Chris Hostetter

: http://projecte01.development.barcelonamedia.org/fonetic/
: you will see a Top Words list (in Spanish and stemmed) in the list there
: is the word si which is in  20649 documents.
: If you click at this word, the system will perform the query 
:   (x) content:si, with no answers at all
: The same for la it is in 17881 documents, but the query  content:la will
: give no answers at all
...
: To see what's going on on the index I have tested with the analyzer
: http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp
...
: las cosas que si no pasan la proxima vez si que no veràs

but are you sure that example would actually cause a problem?
i suspect if you index thta exact sentence as is you wouldn't see the 
facet count for si or que increase at all.

If you do a query for {!raw field=content}que you bypass the query 
parsers (which is respecting your stopwords file) and see all docs that 
contain the raw term que in the content field.

if you look at some of the docs that match, and paste their content field 
into the analysis tool, i think you'll see that the problem comes from 
using the whitespace tokenizer, and is masked by using the WDF 
after the stop filter ... things like Que? are getting ignored by the 
stopfilter, but ultimately winding up in your index as que


-Hoss


Re: facets and stopwords

2009-07-01 Thread JCodina

Sorry , I was too cryptic.

I you follow this link 

http://projecte01.development.barcelonamedia.org/fonetic/
you will see a Top Words list (in Spanish and stemmed) in the list there
is the word si which is in  20649 documents.
If you click at this word, the system will perform the query 
  (x) content:si, with no answers at all
The same for la it is in 17881 documents, but the query  content:la will
give no answers at all

the facets list is generated by the query 
http://projecte01.development.barcelonamedia.org/solr/select/?rows=0start=0q=*:*facet=truefacet.limit=-1facet.field=contentfacet.field=entities_miscwt=jsonjson.wrf=jsonp1246437157825jsoncallback=jsonp1246437157825_=1246437158023

but the question is why these two words (among others) are there if they are
stop words?

To see what's going on on the index I have tested with the analyzer
http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp

If I select the field content and I write the text

las cosas que si no pasan la proxima vez si que no veràs
 
i get the following tokens at the end of the analyzer

las cosapasan   proxima vez sí  verà

where que, si, no, la  are removed as treated as stop words.

but... in the schema browser  
http://projecte01.development.barcelonamedia.org/solr/admin/schema.jsp
in the field content que is the 3rd word no the 4th  si and la are  
between the top 40 terms...

the analyzer for the content can be seen in this page and has the following
analyzers 


Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

   1. org.apache.solr.analysis.StopFilterFactory
args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
   2. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll:
0 generateNumberParts: 1 generateWordParts: 1 }
   3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}
   4. org.apache.solr.analysis.SnowballPorterFilterFactory args:{languange:
Spanish }
   5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}

The field is indexed, tokenized, stored and termvectors are stored.

So, why the stopwords are in the index?





-- 
View this message in context: 
http://www.nabble.com/facets-and-stopwords-tp23952823p24286283.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: facets and stopwords

2009-06-30 Thread Chris Hostetter
: Date: Tue, 9 Jun 2009 16:04:03 -0700 (PDT)
: From: JCodina 
: Subject: facets and stopwords

: I have a text field from where I remove stop words, as a first approximation
: I use facets to see the most common words in the text, but.. stopwords are
: there, and if I search documents having the stopwords, then , there are no
: documents in the answer. 

JCodina: I'm not sure if you've already figured out the solution to your 
problem, but in the future it would help if you could include the relevant 
sections from your schema.xml and solrconfig.xml when asking questions.  
in this case, showing us the field/fieldtype you are faceting on as well 
as how your request handler is declared, and what URL you are using would 
help people understand your problem.

Even looking at the dev port URL you sent, i had no idea which field name 
i should be looking at to try and understand the problem.


-Hoss



facets and stopwords

2009-06-09 Thread JCodina

I have a text field from where I remove stop words, as a first approximation
I use facets to see the most common words in the text, but.. stopwords are
there, and if I search documents having the stopwords, then , there are no
documents in the answer. 
You can test it in this address (using solrjs, the texts are in spanish but
you can check in top words that que or en are there) but if you click on
them to perform the search no results  are given
http://projecte01.development.barcelonamedia.org/fonetic/
or the administrator at
http://projecte01.development.barcelonamedia.org/solr/admin
so you can check wat's going on on the content field.
I use the DataImportHandler to import the data, and
Solr analyzer shows me how  the stopwords are removed from both the query
and the indexed text, but why facets show me these words? 

-- 
View this message in context: 
http://www.nabble.com/facets-and-stopwords-tp23952823p23952823.html
Sent from the Solr - User mailing list archive at Nabble.com.