Glad to hear it!

You shouldn't really have to customize the analyzer to get it to
behave as it would if you just used Solr to ingest documents, just
chain things together. That's what Solr does after all. Of course you
may have special needs that are better served by more customization.

TermsComponent is a useful tool. Note that you also get raw terms if
you use the admin/schema-browser page, identify your field, and then
click the "show term info" button. That technique is somewhat limited
though. The schema-browser page is especially useful for very small
indexes and/or test cases I'll admit. I do vaguely remember something
not right with the schema-browser at one point though, so it might not
work as I expect for 4.4

Best,
Erick

On Fri, May 2, 2014 at 1:56 PM, Yetkin Ozkucur <yetkin.ozku...@asg.com> wrote:
> Erick, Koji, Ahmet:
>
> Thank you all for your answers! I think I found the problem and I am on the 
> right track to fix it.
>
> 1- As you suggested the problem was in the Java code populating the index. 
> The analyzer in the Java code had to be consistent with the one defined in 
> SOLR. I was able to achieve my goal by creating a slightly customized 
> analyzer.
> 2- To be able to see the tokens in the index was key to debug the problem. I 
> downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to 
> see tokens. I did not know SOLR had that terms component. That is a good tip 
> too.
>
> Have a good weekend.
>
> Thanks,
> Yetkin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, May 02, 2014 11:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching for tokens does not return any results
>
> bq:  but this index was created using a Java program using Lucene interface
>
> Elaborating a bit on Koji's comment...
>
> The fact that you used Lucene to index the doc means that the analysis page 
> is almost, but not quite entirely, useless on the indexing side.
> It's looking at your field definition in schema.xml and running your input 
> stream through the indexing portion of your analysis chain constructed from 
> the schema. What's actually in your index though was put there by raw Lucene. 
> So your Lucene program _must_ create an analysis chain that is absolutely 
> identical to what's in your schema for the admin/analysis page to be accurate.
>
> Quick test: go to you "admin/schema browser" page or use the TermsComponent 
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> or Luke to examine the actual tokens in your field. My bet is that you'll see 
> that the actual terms are not what you expect and almost certainly not what 
> the admin/analysis page shows on the index side.
>
> Keeping an independent Lucene program that puts data into your index with raw 
> Lucene aligned with your schema is, as you can see, something of a problem. 
> If at all possible, consider letting Solr do the indexing and sending it 
> documents with SolrJ, here's a reference:
> https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
> By the way, I want to compliment you on your post. You did all the right 
> things:
>> defined your problem clearly
>> added the critical bit (index created with Lucene). This is especially
>> relevant I think illustrated the input and output told us what the
>> problem was gave us the field definitions showed the results of some
>> of your investigation
>
> Best
> Erick
>
> On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
>> Hi Yetkin, welcome!
>>
>> I think StandardAnalyzer of Lucene is the problem you are facing.
>>
>> Why don't you have another field using StandardAnalyzer and see how it
>> tokenizes CRD_PROD on Solr admin GUI?
>>
>> I forgot in the detail but we can use Lucene's Analyzer in schema.xml
>> something like this:
>>
>> <fieldType ...>
>>    <analyzer class="solr.StandardAnalyzer"/> </fieldType>
>>
>> Koji
>> --
>> http://soleami.com/blog/comparing-document-classification-functions-of
>> -lucene-and-mahout.html
>>
>>
>> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>>
>>> Hello everyone,
>>>
>>> I am new to SOLR and this is my first post in this list.
>>> I have been working on this problem for a couple of days. I tried
>>> everything which I found in google but it looks like I am missing something.
>>>
>>> Here is my problem:
>>> I have a field called: DBASE_LOCAT_NM_TEXT It contains values like:
>>> CRD_PROD The goal is to be able to search this field either by
>>> putting the exact string "CRD_PROD" or part of it (tokenized by "_")
>>> like "CRD" or "PROD"
>>>
>>> Currently:
>>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this
>>> does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the
>>> second query does not return any results
>>>
>>> Here is how I configured the field:
>>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>>> stored="true" required="false" multiValued="false"/>
>>>
>>> And Here is how I configured the field type :
>>>      <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>        <analyzer type="index">
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>>> words="stopwords.txt"/>
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>        <analyzer type="query">
>>>          <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt"/>
>>>
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>> I am also using the analysis panel in the SOLR admin console. It
>>> shows
>>> this:
>>> WT      CRD_PROD
>>>
>>> WDF     CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> SF      CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> LCF     crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> SKMF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> RDTF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>>
>>> I am not sure if it is related or not but this index was created
>>> using a Java program using Lucene interface. It used StandardAnalyzer
>>> for writing and the field was configured as tokenized, indexed and
>>> stored.  Does this affect the SOLR configuration?
>>>
>>> Can you please help me understand what I am missing and how I can
>>> debug it?
>>>
>>> Thanks,
>>> Yetkin
>>>
>>
>>
>>

Reply via email to