Re: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-25 Thread Diego Fernandez
The difference comes in the fact that when you query the same form it matches 2 
tokens including the less common one.  When you query a different form you only 
match on the more common form.  So really you're getting the "boost" from both 
the tiny difference in TF*IDF and the extra token that you match on.

However, I agree that adding a payload might be a better solution.

- Original Message -
> Hi - but this makes no sense, they are scored as equals, except for tiny
> differences in TF and IDF. What you would need is something like a stemmer
> that preserves the original token and gives a < 1 payload to the stemmed
> token. The same goes for filters like decompounders and accent folders that
> change meaning of words.
>  
>  
> -Original message-
> > From:Diego Fernandez 
> > Sent: Wednesday 17th September 2014 23:37
> > To: solr-user@lucene.apache.org
> > Subject: Re: How does KeywordRepeatFilterFactory help giving a higher score
> > to an original term vs a stemmed term
> > 
> > I'm not 100% on this, but I imagine this is what happens:
> > 
> > (using -> to mean "tokenized to")
> > 
> > Suppose that you index:
> > 
> > "I am running home" -> "am run running home"
> > 
> > If you then query "running home" -> "run running home" and thus give a
> > higher score than if you query "runs home" -> "run runs home"
> > 
> > 
> > - Original Message -
> > > The Solr wiki says   "A repeated question is "how can I have the
> > > original term contribute
> > > more to the score than the stemmed version"? In Solr 4.3, the
> > > KeywordRepeatFilterFactory has been added to assist this
> > > functionality. "
> > > 
> > > https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> > > 
> > > (Full section reproduced below.)
> > > I can see how in the example from the wiki reproduced below that both
> > > the stemmed and original term get indexed, but I don't see how the
> > > original term gets more weight than the stemmed term.  Wouldn't this
> > > require a filter that gives terms with the keyword attribute more
> > > weight?
> > > 
> > > What am I missing?
> > > 
> > > Tom
> > > 
> > > 
> > > 
> > > -
> > > "A repeated question is "how can I have the original term contribute
> > > more to the score than the stemmed version"? In Solr 4.3, the
> > > KeywordRepeatFilterFactory has been added to assist this
> > > functionality. This filter emits two tokens for each input token, one
> > > of them is marked with the Keyword attribute. Stemmers that respect
> > > keyword attributes will pass through the token so marked without
> > > change. So the effect of this filter would be to index both the
> > > original word and the stemmed version. The 4 stemmers listed above all
> > > respect the keyword attribute.
> > > 
> > > For terms that are not changed by stemming, this will result in
> > > duplicate, identical tokens in the document. This can be alleviated by
> > > adding the RemoveDuplicatesTokenFilterFactory.
> > > 
> > >  > > positionIncrementGap="100">
> > >  
> > >
> > >
> > >
> > >
> > >  
> > > "
> > > 
> > 
> > --
> > Diego Fernandez - 爱国
> > Software Engineer
> > GSS - Diagnostics
> > 
> > 
> 

-- 
Diego Fernandez - 爱国
Software Engineer
GSS - Diagnostics

IRC: aiguofer on #gss and #customer-platform


Re: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-17 Thread Diego Fernandez
I'm not 100% on this, but I imagine this is what happens:

(using -> to mean "tokenized to")

Suppose that you index:

"I am running home" -> "am run running home"

If you then query "running home" -> "run running home" and thus give a higher 
score than if you query "runs home" -> "run runs home"


- Original Message -
> The Solr wiki says   "A repeated question is "how can I have the
> original term contribute
> more to the score than the stemmed version"? In Solr 4.3, the
> KeywordRepeatFilterFactory has been added to assist this
> functionality. "
> 
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> 
> (Full section reproduced below.)
> I can see how in the example from the wiki reproduced below that both
> the stemmed and original term get indexed, but I don't see how the
> original term gets more weight than the stemmed term.  Wouldn't this
> require a filter that gives terms with the keyword attribute more
> weight?
> 
> What am I missing?
> 
> Tom
> 
> 
> 
> -
> "A repeated question is "how can I have the original term contribute
> more to the score than the stemmed version"? In Solr 4.3, the
> KeywordRepeatFilterFactory has been added to assist this
> functionality. This filter emits two tokens for each input token, one
> of them is marked with the Keyword attribute. Stemmers that respect
> keyword attributes will pass through the token so marked without
> change. So the effect of this filter would be to index both the
> original word and the stemmed version. The 4 stemmers listed above all
> respect the keyword attribute.
> 
> For terms that are not changed by stemming, this will result in
> duplicate, identical tokens in the document. This can be alleviated by
> adding the RemoveDuplicatesTokenFilterFactory.
> 
>  positionIncrementGap="100">
>  
>
>
>
>
>  
> "
> 

-- 
Diego Fernandez - 爱国
Software Engineer
GSS - Diagnostics



Re: [Announce] Apache Solr 4.10 with RankingAlgorithm 1.5.4 available now with complex-lsa algorithm (simulates human language acquisition and recognition)

2014-09-09 Thread Diego Fernandez
Interesting.  Does anyone know how that compares to this 
http://www.searchbox.com/products/searchbox-plugins/solr-sense/?

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


- Original Message -
> Hi!
> 
> I am very excited to announce the availability of Apache Solr 4.10 with
> RankingAlgorithm 1.5.4.
> 
> Solr 4.10.0 with RankingAlgorithm 1.5.4 includes support for complex-lsa.
> complex-lsa simulates human language acquisition and recognition (see demo
> <http://solr-ra.tgels.org/rankingsearchlsa.jsp> ) and can retrieve
> semantically related/hidden relationships between terms, sentences,
> paragraphs, chapters, books, images, etc. Three new similarities,
> TERM_SIMILARITY, DOCUMENT_SIMILARITY, TERM_DOCUMENT_SIMILARITY enable these
> with improved precision.  A query for “holy AND ghost” returns jesus/christ
> as the top results for the bible corpus with no effort to introduce this
> relationship (see demo <http://solr-ra.tgels.org/rankingsearchlsa.jsp> ).
> 
>  
> 
> This version adds support for  multiple linear algebra libraries.
> complex-lsa does a large amount of this calcs so speeding this up should
> speed up the retrieval etc. EJML is the fastest if you are using complex-lsa
> for a smaller set of documents, while MTJ is faster as your document
> collection becomes bigger. MTJ can also use BLAS/LAPACK, etc installed on
> your system to further improve performance with native execution. The
> performance is similar to a C/C++ application. It can also make use of GPUs
> or Intel's mkl library if you have access to it.
> 
> RankingAlgorithm 1.5.4 with complex-lsa supports the entire Lucene Query
> Syntax , ± and/or boolean/dismax/glob/regular
> expression/wildcard/fuzzy/prefix/suffix queries with boosting, etc. This
> version increases performance, with increased accuracy and relevance for
> Document similarity, fixes problems with phrase queries,  Boolean queries,
> etc.
> 
> 
> You can get more information about complex-lsa and realtime-search
> performance from here:
> http://solr-ra.tgels.org/wiki/en/Complex-lsa-demo
> 
> You can download Solr 4.10 with RankingAlgorithm 1.5.4 from here:
> http://solr-ra.tgels.org
> 
> Please download and give the new version a try.
> 
> Regards,
> 
> Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://elasticsearch-ra.tgels.org
> http://rankingalgorithm.tgels.org
> 
> Note:
> 1. Apache Solr 4.10 with RankingAlgorithm 1.5.4 is an external project.
> 
> 
> 
> 


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Diego Fernandez
Although not a solution, this may help in trying to find the problem.
In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says:

"It is worth noting that there is an additional attribute for the text field 
type:

autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing 
tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set 
of tokens. Setting the attribute to true (default value) will automatically 
generate phrase queries. This means that WordDelimiterFilter will divide the 
word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set 
to true query sent to Lucene will look like "field:wi fi", while with set to 
false Lucene query will look like field:wi OR field:fi. However, please note, 
that this attribute only behaves well with tokenizers based on white spaces."

Since phrases are made by looking at the position, it is possible that the 
position set for the other generated tokens have something to do with it.  Have 
you tried turning autoGeneratePhraseQueries="false" to see if it'll match both? 
(I know that might have other unintended behaviors but it might give some 
insight into the problem)

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



- Original Message -
> On 9/2/14 1:51 PM, Erick Erickson wrote:
> > bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
> > not "macbook"
> >
> > I suspect your query parameters for WordDelimiterFilterFactory doesn't have
> > catenate words set.
> >
> > What do you see when you enter these in both the index and query portions
> > of the admin/analysis page?
> 
> Thanks Erick!
> 
> Our WordDelimiterFilterFactory does have catenate words set, in both
> index and query phases (is that right?):
> 
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> 
> It's hard to cut and paste the results of the analysis page into email
> (or anywhere!), I'll give you screenshots, sorry -- and I'll give them
> for our whole real world app complex field definition. I'll also paste
> in our entire field definition below. But I realize my next step is
> probably creating a simpler isolation/reproduction case (unless you have
> a magic answer from this!).
> 
> Again, the problem is that "MacBook" seems to be only matching on
> indexed "macbook" and not indexed "mac book".
> 
> 
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
> 
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
> 
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
> 
> 
> Our entire actual field definition:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>
> 
>  rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
> 
>  synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
> 
>   generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>  
>  
> 
> 
>  
> 
>   language="English" protected="protwords.txt"/>
>  
>
>  
> 
> 
> 
> 
> 


Re: questions on Solr WordBreakSolrSpellChecker and WordDelimiterFilterFactory

2014-07-16 Thread Diego Fernandez
Which tokenizer are you using?  StandardTokenizer will split "x-box" into "x" 
and "box", same as "x box".

If there's not too many of these, you could also use the 
PatternReplaceCharFilterFactory to map "x box" and "x-box" to "xbox" before the 
tokenizer.

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


- Original Message -
> Jia,
> 
> I agree that for the spellcheckers to work, you need   name="last-components"> instead of .
> 
> But the "x-box" => "xbox" example ought to be solved by analyzing using
> WordDelimiterFilterFactory and "catenateWords=1" at query-time.  Did you
> re-index after changing your analysis chain (you need to)?  Perhaps you can
> show your full analyzer configuration, and someone here can help you find
> the problem. Also, the Analysis page on the solr Admin UI is invaluable for
> debugging text-field analyzer problems.
> 
> Getting "x box" to analyze to "xbox" is trickier (but possible).  The
> WordBreakSpellChecker is probably your best option if you have cases like
> this in your data & users' queries.
> 
> Of course, if you have a finite number of products that have spelling
> variants like this, SynonymFilterFactory might be all you need.  I would
> recommend using index-time synonyms for your case rather than query-time
> synonyms.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
> Sent: Wednesday, July 16, 2014 7:42 AM
> To: solr-user@lucene.apache.org; j...@ece.ubc.ca
> Subject: Re: questions on Solr WordBreakSolrSpellChecker and
> WordDelimiterFilterFactory
> 
> Hi Jia,
> 
> What happens when you use
> 
>  
> 
> instead of
> 
>  
> 
> Ahmet
> 
> 
> On Wednesday, July 16, 2014 3:07 AM, "j...@ece.ubc.ca" 
> wrote:
> 
> 
> 
> Hello everyone :)
> 
> I have a product called "xbox" indexed, and when the user search for
> either "x-box" or "x box" i want the "xbox" product to be
> returned.  I'm new to Solr, and from reading online, I thought I need
> to use WordDelimiterFilterFactory for "x-box" case, and
> WordBreakSolrSpellChecker for "x box" case. Is this correct?
> 
> (1) In my schema file, this is what I changed:
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="1" splitOnCaseChange="0" preserveOriginal="1"/>
> 
> But I don't see the xbox product returned when the search term is
> "x-box", so I must have missed something
> 
> (2) I tried to use  WordBreakSolrSpellChecker together with
> DirectSolrSpellChecker as shown below, but the WordBreakSolrSpellChecker
> never got used:
> 
>  class="solr.SpellCheckComponent">
>     wc_textSpell
> 
>     
>       default
>       spellCheck
>       solr.DirectSolrSpellChecker
>       internal
>           0.3
>             2
>             1
>             5
>             3
>             0.01
>             0.004
>     
> 
>     wordbreak
>     solr.WordBreakSolrSpellChecker
>     spellCheck
>     true
>     true
>     10
>   
>   
> 
>    class="org.apache.solr.handler.component.SearchHandler">
>     
>         SpellCheck
>         true
>        default
>         wordbreak
>          true
>        false
>        10
>        true
>        false
>     
>     
>       wc_spellcheck
>     
>   
> 
> I tried to build the dictionary this way:
> http://localhost/solr/coreName/select?spellcheck=true&spellcheck.build=true,
> but the response returned is this:
> 
> 
> 0
> 0
> 
> true
> true
> 
> 
> build
> 
> 
> 
> What's the correct way to build the dictionary?
> Even though my requestHandler's name="/spellcheck", i wasn't able to
> use
> http://localhost/solr/coreName/spellcheck?spellcheck=true&spellcheck.build=true
> .. is there something wrong with my definition above?
> 
> (3) I also tried to use WordBreakSolrSpellChecker without the
> DirectSolrSpellChecker as shown below:
>  class="solr.SpellCheckComponent">
> 
>   wc_textSpell
>     
>     default
>     solr.WordBreakSolrSpellChecker
>     spellCheck
>     true
>     true
>     10
>   
>    
> 
>     class="org.apache.solr.handler.component.SearchHandler">
>     
>         SpellCheck
>         true
>        default
>         
>          true
>        false
>        10
>        true
>        false
>     
>     
>       wc_spellcheck
>     
>   
> 
> And still unable to see WordBreakSolrSpellChecker being called anywhere.
> 
> Would someone kindly help me?
> 
> Many thanks,
> Jia
> 
> 
> 


Re: Synonyms - 20th and 20

2014-06-18 Thread Diego Fernandez
What tokenizer and filters are you using?

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


- Original Message -
> I have a synonyms.txt file which has
> 20th,twentieth
> 
> Once I apply the synonym, I see "20th", "twentieth" and "20" for "20th".
> Does anyone know where "20" comes from? How can I have only "20th" and
> "twentieth"?
> 
> Thanks,
> 
> Jae
> 


Sum of OR'd nested query scores

2014-06-02 Thread Diego Fernandez
Hi! I have a question which I posted on 
http://stackoverflow.com/questions/23959727/sum-of-nested-queries-in-solr about 
taking the sum of OR'd nested queries.  I'll repeat it here, but if you want 
some SO points and have an answer, feel free to answer there.

[quote]

We have a search that takes text from two different fields and then searches 
against 3 different fields.

The two input fields are a summary and description, and the fields we're 
searching in are issue, title, and body. We'd like to essentially perform the 2 
queries and then add up their score. We'd also like to boost the summary query 
and the issue field. Here's what I have right now (slightly simplified):

_query_:"{!edismax qf='issue^2 body title' boost=1.5}" OR 
_query_:"{!edismax qf='issue^2 body title'}"

The problem with this is that if either summary or description don't match to 
anything in a result, the result gets penalized with coord(1/2). I would like 
to simply add the two together and ignore the coord. Is there a way to do this?

[/quote]

Thanks!

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics




Sum of nested queries in Solr

2014-05-30 Thread Diego Fernandez
Hi! I have a question which I posted on 
http://stackoverflow.com/questions/23959727/sum-of-nested-queries-in-solr about 
taking the sum of OR'd nested queries.  I'll repeat it here, but if you want 
some SO points and have an answer, feel free to answer there.

[quote]

We have a search that takes text from two different fields and then searches 
against 3 different fields.

The two input fields are a summary and description, and the fields we're 
searching in are issue, title, and body. We'd like to essentially perform the 2 
queries and then add up their score. We'd also like to boost the summary query 
and the issue field. Here's what I have right now (slightly simplified):

_query_:"{!edismax qf='issue^2 body title' boost=1.5}" OR 
_query_:"{!edismax qf='issue^2 body title'}"

The problem with this is that if either summary or description don't match to 
anything in a result, the result gets penalized with coord(1/2). I would like 
to simply add the two together and ignore the coord. Is there a way to do this?

[/quote]

Thanks!

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics




Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-20 Thread Diego Fernandez
Hey Ahmet, 

Yeah I had missed Shawn's response, I'll have to give that a try as well.  As 
for the version, we're using 4.4.  StandardTokenizer sets type for HANGUL, 
HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're 
using TypeTokenFilter to remove those.

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


- Original Message -
> Hi Diego,
> 
> Did you miss Shawn's response? His ICUTokenizerFactory solution is better
> than mine.
> 
> By the way, what solr version are you using? Does StandardTokenizer set type
> attribute for CJK words?
> 
> To filter out given types, you not need a custom filter. Type Token filter
> serves exactly that purpose.
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter
> 
> 
> 
> On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez 
> wrote:
> Great, thanks for the information!  Right now we're using the
> StandardTokenizer types to filter out CJK characters with a custom filter.
>   I'll test using MappingCharFilters, although I'm a little concerned with
> possible adverse scenarios.
> 
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> 
> 
> 
> - Original Message -
> > Hi Aiguofer,
> > 
> > You mean ClassicTokenizer? Because StandardTokenizer does not set token
> > types
> > (e-mail, url, etc).
> > 
> > 
> > I wouldn't go with the JFlex edit, mainly because maintenance costs. It
> > will
> > be a burden to maintain a custom tokenizer.
> > 
> > MappingCharFilters could be used to manipulate tokenizer behavior.
> > 
> > Just an example, if you don't want your tokenizer to break on hyphens,
> > replace it with something that your tokenizer does not break. For example
> > under score.
> > 
> > "-" => "_"
> > 
> > 
> > 
> > Plus WDF can be customized too. Please see types attribute :
> > 
> > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> > 
> >  
> > Ahmet
> > 
> > 
> > On Friday, May 16, 2014 6:24 PM, aiguofer  wrote:
> > Jack Krupansky-2 wrote
> > 
> > > Typically the white space tokenizer is the best choice when the word
> > > delimiter filter will be used.
> > > 
> > > -- Jack Krupansky
> > 
> > If we wanted to keep the StandardTokenizer (because we make use of the
> > token
> > types) but wanted to use the WDFF to get combinations of words that are
> > split with certain characters (mainly - and /, but possibly others as
> > well),
> > what is the suggested way of accomplishing this? Would we just have to
> > extend the JFlex file for the tokenizer and re-compile it?
> > 
> > 
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> >
> 


Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-20 Thread Diego Fernandez
Great, thanks for the information!  Right now we're using the StandardTokenizer 
types to filter out CJK characters with a custom filter.  I'll test using 
MappingCharFilters, although I'm a little concerned with possible adverse 
scenarios.  

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


- Original Message -
> Hi Aiguofer,
> 
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
> 
> 
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
> 
> MappingCharFilters could be used to manipulate tokenizer behavior.
> 
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
> 
> "-" => "_"
> 
> 
> 
> Plus WDF can be customized too. Please see types attribute :
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> 
>  
> Ahmet
> 
> 
> On Friday, May 16, 2014 6:24 PM, aiguofer  wrote:
> Jack Krupansky-2 wrote
> 
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> > 
> > -- Jack Krupansky
> 
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
>