Re: Luke and SOLR search giving different results

Erol Akarsu Mon, 03 Dec 2012 12:44:53 -0800

Jack,

I see interesting stuff here now.


I tried  as search query  not "baş" but "features:baş" in field "q" in SOLR
GUI. And, I got result!

In the one document, I had some fields type of text_eng, text_general and
one field features type of text_tr. If I don't specify field name, SOLR use
EnglishAnalyzer. If I do, it uses the analyzer specific to field specified
in search query string.

Is this true?

Erol Akarsu

On Mon, Dec 3, 2012 at 1:30 PM, Erol Akarsu <eaka...@gmail.com> wrote:

> Jack,
>
> I have these in schema.xml that defines "features" as type of text_tr
>
> But unfortunately, this fails.
>
>
>  <field name="features" type="text_tr" indexed="true" stored="true"
> multiValued="true"/>
> <copyField source="features" dest="text"/>
>
>
> <fieldType name="text_tr" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.TurkishLowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_tr.txt" enablePositionIncrements="true"/>
>          <filter class="solr.SnowballPorterFilterFactory"
> language="Turkish"/>
>       </analyzer>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.TurkishLowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_tr.txt" enablePositionIncrements="true"/>
>          <filter class="solr.SnowballPorterFilterFactory"
> language="Turkish"/>
>       </analyzer>
>     </fieldType>
>
>
>
>
> On Mon, Dec 3, 2012 at 1:15 PM, Jack Krupansky <j...@basetechnology.com>wrote:
>
>> Ah! See where it says "<str name="parsedquery_toString">**text:baş</str>"?
>> Your query is against the "text" field, which probably doesn't have the
>> Turkish analysis.
>>
>> There is probably a copyField from "features" to "text". You use the
>> "text_tr" field type for "features", but probably not for the "text" field.
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Erol Akarsu
>> Sent: Monday, December 03, 2012 1:06 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: Luke and SOLR search giving different results
>>
>> Jack,
>>
>> I have already set tomcat server fro UTF-Encoding before. I have added
>> URIEncoding="UTF-8" to all <Connector ..> elements in server.xml in Tomcat
>> 7.
>>
>> As you see below, when I search  word "baş"  with debug mode I can see
>> empty response. But  when I search word "baştan", I can get correct
>> response.
>>
>> It seems to me that TurkishAnalyser is not being used in SOLR search
>> because we can make only full word search "baştan" but not the root word
>> "baş". Probably, English Analyzer is being used and could not find the
>> root
>> word. For example, in Luke, if I change "Analyser to use for query
>> parsing"
>> to EnglishAnalyser, then it can not find word "baş" but it can with
>> TurkishAnalyser" only. I guess SOLR is not using TurkishAnalyzer.
>>
>> Is this assumption true? I could not find any other reason
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <response>
>>    <lst name="responseHeader">
>>        <int name="status">0</int>
>>        <int name="QTime">58</int>
>>        <lst name="params">
>>            <str name="debugQuery">true</str>
>>            <str name="q">baş</str>
>>            <str name="wt">xml</str>
>>        </lst>
>>    </lst>
>>    <result name="response" numFound="0" start="0" />
>>    <lst name="debug">
>>        <str name="rawquerystring">baş</**str>
>>        <str name="querystring">baş</str>
>>        <str name="parsedquery">text:baş</**str>
>>        <str name="parsedquery_toString">**text:baş</str>
>>        <lst name="explain" />
>>        <str name="QParser">LuceneQParser</**str>
>>        <lst name="timing">
>>            <double name="time">38.0</double>
>>            <lst name="prepare">
>>                <double name="time">16.0</double>
>>                <lst
>> name="org.apache.solr.handler.**component.QueryComponent">
>>                    <double name="time">3.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.FacetComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.HighlightComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.StatsComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.DebugComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>            </lst>
>>            <lst name="process">
>>                <double name="time">10.0</double>
>>                <lst
>> name="org.apache.solr.handler.**component.QueryComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.FacetComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.HighlightComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.StatsComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.DebugComponent">
>>                    <double name="time">10.0</double>
>>                </lst>
>>            </lst>
>>        </lst>
>>    </lst>
>> </response>
>>
>> <response>
>>    <lst name="responseHeader">
>>        <int name="status">0</int>
>>        <int name="QTime">2</int>
>>        <lst name="params">
>>            <str name="debugQuery">true</str>
>>            <str name="q">baştan</str>
>>            <str name="wt">xml</str>
>>        </lst>
>>    </lst>
>>    <result name="response" numFound="1" start="0">
>>        <doc>
>>            <str name="url">htt://111.a.b1</**str>
>>            <str name="id">6H500F0XXXX</str>
>>            <str name="lang">tr</str>
>>            <str name="name">Maxtor DiamondMax 11 - hard drive - 500 GB -
>> SATA-300
>>            </str>
>>            <str name="manu">Maxtor Corp.</str>
>>            <str name="manu_id_s">maxtor</str>
>>            <arr name="cat">
>>                <str>electronics</str>
>>                <str>hard drive</str>
>>            </arr>
>>            <arr name="features">
>>                <str>SATA 3.0Gb/s, NCQ</str>
>>                <str>8.5ms seek</str>
>>                <str>16MB cache</str>
>>                <str>
>>                    Firmalarsa "Nasılsa buldum oynatacak ünlüyü, neyleyim
>> senaryoyu!" diyerek
>>                    baştan savma reklamlarla kotarmaya bakıyor işi.
>> Futbolcu Arda Turan
>>                    ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un
>> oynatıldığı
>>                    giyim firması reklamı da tam bir fiyasko. Birbirinden
>> ünlü bu iki
>>                    ismin oynadığı reklam Arda'nın kabinde papağan gibi
>> tekrarladığı
>>                    "My darling!" repliği, sonunda Paris'i görünce anlam
>> veremediğimiz
>>                    uyduruk bayılma sahnesi, bir de Paris'in ancak 5 kez
>> izledikten
>>                    sonra anlaşılan "Paris seçti, firma yaptı, Arda
>> bayıldı."
>>                    sözleriyle kazındı hafızalara, "Keşke unutabilsek!"
>> dedirterek.
>>                </str>
>>            </arr>
>>            <float name="price">350.0</float>
>>            <str name="price_c">350,USD</str>
>>            <int name="popularity">6</int>
>>            <bool name="inStock">true</bool>
>>            <date name="manufacturedate_dt">**2006-02-13T15:26:37Z</date>
>>            <long name="_version_">**1420300467908378624</long>
>>        </doc>
>>    </result>
>>    <lst name="debug">
>>        <str name="rawquerystring">baştan</**str>
>>        <str name="querystring">baştan</**str>
>>        <str name="parsedquery">text:**baştan</str>
>>        <str name="parsedquery_toString">**text:baştan</str>
>>        <lst name="explain">
>>            <str name="6H500F0XXXX">
>>                0.028767452 = (MATCH) weight(text:baştan in 0)
>> [DefaultSimilarity], result of:
>>                0.028767452 = fieldWeight in 0, product of:
>>                1.0 = tf(freq=1.0), with freq of:
>>                1.0 = termFreq=1.0
>>                0.30685282 = idf(docFreq=1, maxDocs=1)
>>                0.09375 = fieldNorm(doc=0)
>>            </str>
>>        </lst>
>>        <str name="QParser">LuceneQParser</**str>
>>        <lst name="timing">
>>            <double name="time">2.0</double>
>>            <lst name="prepare">
>>                <double name="time">1.0</double>
>>                <lst
>> name="org.apache.solr.handler.**component.QueryComponent">
>>                    <double name="time">1.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.FacetComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.HighlightComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.StatsComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.DebugComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>            </lst>
>>            <lst name="process">
>>                <double name="time">1.0</double>
>>                <lst
>> name="org.apache.solr.handler.**component.QueryComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.FacetComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.HighlightComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.StatsComponent">
>>                    <double name="time">0.0</double>
>>                </lst>
>>                <lst
>> name="org.apache.solr.handler.**component.DebugComponent">
>>                    <double name="time">1.0</double>
>>                </lst>
>>            </lst>
>>        </lst>
>>    </lst>
>> </response>
>>
>> On Mon, Dec 3, 2012 at 12:30 PM, Jack Krupansky <j...@basetechnology.com>
>> **wrote:
>>
>>  Two points:
>>>
>>> 1. Possibly an encoding problem with your container? Is UTF-8 encoding
>>> enabled?
>>> 2. Add &debugQuery=true to your query (from the browser) and see if the
>>> parser_query has the expected term that matches what Luke reports for the
>>> index and what Solr Admin Analysis also reports for index analysis.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Erol Akarsu
>>> Sent: Monday, December 03, 2012 11:35 AM
>>>
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Luke and SOLR search giving different results
>>>
>>> Jack,
>>>
>>> Yes.
>>>
>>> I expect SOLR should give same search results as Luked does.
>>>
>>> Term analyzer gives correct answer in SOLR as expected. But SOLR does not
>>> return correct search results.
>>>
>>> I don't know why.
>>>
>>> Erol Akarsu
>>>
>>> On Mon, Dec 3, 2012 at 11:21 AM, Jack Krupansky <j...@basetechnology.com
>>> >*
>>> *wrote:
>>>
>>>
>>>  So, does that highlight the problem for you or not? Is the term analyzed
>>>
>>>> as you expected?
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> From: Erol Akarsu
>>>> Sent: Monday, December 03, 2012 8:44 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Luke and SOLR search giving different results
>>>>
>>>> Jack,
>>>>
>>>> Thanks for help.
>>>>
>>>> I removed data folder  of SOLR and indexed this sample doc from scratch,
>>>> there was no document in SOLR but only one.
>>>>
>>>> When I analysed , I can see stemming is correct and I can see these for
>>>> words "bul", "baş" ,"gör" and "umut" in SF row
>>>> I attached analyse screens
>>>>
>>>> Erol Akarsu
>>>>
>>>>
>>>> On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky <
>>>> j...@basetechnology.com>
>>>> wrote:
>>>>
>>>>   Have you tried using the Solr Admin Analysis page, using the word and
>>>> a
>>>> few words of context for index analysis and the word alone for query
>>>> analysis?
>>>>
>>>>   And be sure to fully reindex if you change ANYTHING in the schema
>>>> fields
>>>> or field types.
>>>>
>>>>   -- Jack Krupansky
>>>>
>>>>   From: Erol Akarsu
>>>>   Sent: Sunday, December 02, 2012 10:38 PM
>>>>   To: solr-user@lucene.apache.org
>>>>   Subject: Luke and SOLR search giving different results
>>>>
>>>>
>>>>   Hi,
>>>>
>>>>   I am trying to apply SOLR for Turkish Language for my research.
>>>>
>>>>   Instead of using language identification, I manually assigned Turkish
>>>> language for a sample test document. I have configured SOLR schema.xml,
>>>> activated the part below. I have added the attached document
>>>> testTurkishDoc.xml that is inserted to SOLR database.
>>>>
>>>>   But searching for raw Lucene index through Luke and SOLR 4.0 search
>>>> though GUI is giving different results. In picture Selection_006.png,
>>>> the
>>>> word "baş" is listed as top term. I search the word "baş" in Luke and I
>>>> got
>>>> the result result that is only document, shown in Selection_004.png.
>>>>
>>>>   But in SOLR GUI, I am getting empty result for word "baş" in picture
>>>> Selection_002.png.
>>>>
>>>>   In the text we have  features field, that has word "baştan" that is
>>>> being derived from root word "baş" in Turkish Grammar. Somehow, SOLR GUI
>>>> is
>>>> doing search different than Luke. I could not figure it out why I could
>>>> not
>>>> find it while getting in Luke. The same thing happens for words "umut",
>>>> "bul" and "gör".
>>>>
>>>>   I will appreciate if you can help me to get same results from SOLR UI.
>>>>
>>>>
>>>>   <field name="features">
>>>>          Firmalarsa "Nasılsa buldum oynatacak ünlüyü, neyleyim
>>>> senaryoyu!"
>>>> diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda
>>>> Turan
>>>> ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un oynatıldığı
>>>> giyim
>>>> firması reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin
>>>> oynadığı
>>>> reklam Arda'nın kabinde papağan gibi tekrarladığı "My darling!" repliği,
>>>> sonunda Paris'i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir
>>>> de
>>>> Paris'in ancak 5 kez izledikten sonra anlaşılan "Paris seçti, firma
>>>> yaptı,
>>>> Arda bayıldı." sözleriyle kazındı hafızalara, "Keşke unutabilsek!"
>>>> dedirterek.
>>>>     </field>
>>>>
>>>>
>>>>
>>>>   Added to schema.xml for SOLR:
>>>>
>>>>   <field name="features" type="text_tr" indexed="true" stored="true"
>>>> multiValued="true"/>
>>>>   <fieldType name="text_tr" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>         <analyzer type="index">
>>>>           <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>>           <filter class="solr.****TurkishLowerCaseFilterFactory"****/>
>>>>
>>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="lang/stopwords_tr.txt" enablePositionIncrements="****true"/>
>>>>           <filter class="solr.****SnowballPorterFilterFactory"
>>>>
>>>> language="Turkish"/>
>>>>         </analyzer>
>>>>         <analyzer type="query">
>>>>           <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>>           <filter class="solr.****TurkishLowerCaseFilterFactory"****/>
>>>>
>>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="lang/stopwords_tr.txt" enablePositionIncrements="****true"/>
>>>>           <filter class="solr.****SnowballPorterFilterFactory"
>>>>
>>>> language="Turkish"/>
>>>>         </analyzer>
>>>>       </fieldType>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Luke and SOLR search giving different results

Reply via email to