Erick, Koji, Ahmet:

Thank you all for your answers! I think I found the problem and I am on the 
right track to fix it.

1- As you suggested the problem was in the Java code populating the index. The 
analyzer in the Java code had to be consistent with the one defined in SOLR. I 
was able to achieve my goal by creating a slightly customized analyzer.
2- To be able to see the tokens in the index was key to debug the problem. I 
downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see 
tokens. I did not know SOLR had that terms component. That is a good tip too.

Have a good weekend.

Thanks,
Yetkin

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, May 02, 2014 11:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching for tokens does not return any results

bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis page is 
almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your input 
stream through the indexing portion of your analysis chain constructed from the 
schema. What's actually in your index though was put there by raw Lucene. So 
your Lucene program _must_ create an analysis chain that is absolutely 
identical to what's in your schema for the admin/analysis page to be accurate.

Quick test: go to you "admin/schema browser" page or use the TermsComponent 
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that you'll see 
that the actual terms are not what you expect and almost certainly not what the 
admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index with raw 
Lucene aligned with your schema is, as you can see, something of a problem. If 
at all possible, consider letting Solr do the indexing and sending it documents 
with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
> defined your problem clearly
> added the critical bit (index created with Lucene). This is especially 
> relevant I think illustrated the input and output told us what the 
> problem was gave us the field definitions showed the results of some 
> of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
> Hi Yetkin, welcome!
>
> I think StandardAnalyzer of Lucene is the problem you are facing.
>
> Why don't you have another field using StandardAnalyzer and see how it 
> tokenizes CRD_PROD on Solr admin GUI?
>
> I forgot in the detail but we can use Lucene's Analyzer in schema.xml 
> something like this:
>
> <fieldType ...>
>    <analyzer class="solr.StandardAnalyzer"/> </fieldType>
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of
> -lucene-and-mahout.html
>
>
> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>
>> Hello everyone,
>>
>> I am new to SOLR and this is my first post in this list.
>> I have been working on this problem for a couple of days. I tried 
>> everything which I found in google but it looks like I am missing something.
>>
>> Here is my problem:
>> I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: 
>> CRD_PROD The goal is to be able to search this field either by 
>> putting the exact string "CRD_PROD" or part of it (tokenized by "_")  
>> like "CRD" or "PROD"
>>
>> Currently:
>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this 
>> does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the 
>> second query does not return any results
>>
>> Here is how I configured the field:
>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>> stored="true" required="false" multiValued="false"/>
>>
>> And Here is how I configured the field type :
>>      <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>        <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>> words="stopwords.txt"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>        <analyzer type="query">
>>          <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>      </fieldType>
>>
>> I am also using the analysis panel in the SOLR admin console. It 
>> shows
>> this:
>> WT      CRD_PROD
>>
>> WDF     CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> SF      CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> LCF     crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> SKMF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> RDTF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>>
>> I am not sure if it is related or not but this index was created 
>> using a Java program using Lucene interface. It used StandardAnalyzer 
>> for writing and the field was configured as tokenized, indexed and 
>> stored.  Does this affect the SOLR configuration?
>>
>> Can you please help me understand what I am missing and how I can 
>> debug it?
>>
>> Thanks,
>> Yetkin
>>
>
>
>

Reply via email to