Re: Field Analyzers: which values are indexed?

2011-04-13 Thread Ben Davies
Thanks both for your replies

Eric,
Yep, I use the Analysis page extensively, but what I was directly looking
for was whether all of only the last line of values given by the analysis
page, where eventually indexed.
I think we've concluded it's only the last line.

Cheers,
Ben

On Wed, Apr 13, 2011 at 2:41 PM, Erick Erickson wrote:

> CharFilterFactories are applied to the raw input before tokenization.
> Each token output from the tokenization is then sent through
> the rest of the chain.
>
> The Analysis page available from the Solr admin page is
> invaluable in answering in great detail what each part of
> an analysis chain does.
>
> TokenFilterFactories are applied to each token emitted from
> the tokenizer, and this includes the similar
> PatternReplaceFilterFactory. The difference is that the
> PatternReplaceCharFilterFactory is applied before tokenization
> to the entire input stream and PatternReplaceFilterFactory
> is applied to each token emitted by the tokenizer.
>
> And to make it even more fun, you can do both!
>
> Best
> Erick
>
> On Wed, Apr 13, 2011 at 8:14 AM, Ben Davies  wrote:
>
> > Hi there,
> >
> > Just a quick question that the wiki page (
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) didn't seem
> > to
> > answer very well.
> >
> > Given an analyzer that has  zero or more Char Filter Factories, one
> > Tokenizer Factory, and zero or more Token Filter Factories, which
> value(s)
> > are indexed?
> >
> > Is every value that is produced from each char filter, tokenizer, and
> > filter
> > indexed?
> > Or is the only the final value after completing the whole chain indexed?
> >
> > Cheers,
> > Ben
> >
>


Field Analyzers: which values are indexed?

2011-04-13 Thread Ben Davies
Hi there,

Just a quick question that the wiki page (
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) didn't seem to
answer very well.

Given an analyzer that has  zero or more Char Filter Factories, one
Tokenizer Factory, and zero or more Token Filter Factories, which value(s)
are indexed?

Is every value that is produced from each char filter, tokenizer, and filter
indexed?
Or is the only the final value after completing the whole chain indexed?

Cheers,
Ben


Re: Indexing data with Trade Mark Symbol

2011-04-05 Thread Ben Davies
Use admin/analysis.jsp to see which filter is removing it.
Configure a field type appropriate to what you want to index.

On Mon, Apr 4, 2011 at 9:55 AM, mechravi25  wrote:

> Hi,
>  Has anyone indexed the data with Trade Mark symbol??...when i tried to
> index, the data appears as below.
>
> Data:
>  79797 - Siebel Research– AI Fund,
>  79797 - Siebel Research– AI Fund,l
>
>
> Original Data:
> 79797 - Siebel Research™ AI Fund,
>
>
> Please help me to resolve this
>
> Regards,
> Ravi
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-data-with-Trade-Mark-Symbol-tp2774421p2774421.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: question on solr.ASCIIFoldingFilterFactory

2011-04-05 Thread Ben Davies
I can't remember where I read it, but I think MappingCharFilterFactory is
prefered.
There is an example in the example schema.



>From this, I get:
org.apache.solr.analysis.MappingCharFilterFactory
{mapping=mapping-ISOLatin1Accent.txt}
|text|despues|



On Tue, Apr 5, 2011 at 5:06 PM, Nemani, Raj  wrote:

> All,
>
> I am using solr.ASCIIFoldingFilterFactory to perform accent insensitive
> search.  One of the words that got indexed as part my indexing process is
> "después".  Having used the ASCIIFoldingFilterFactory,I expected that If I
> searched for word "despues" I should have the document containing the word
> "después" show up in the results but that was not the case.  Then I used the
> Analysis.jsp to analyze "después" and noticed that the
> ASCIIFoldingFilterFactory folded "después" as "despue".
>
>
>
> If I repeat the above exercise for the word "Imágenes", then Analysis.jsp
> tell me that the ASCIIFoldingFilterFactory folded "Imágenes" as "imagen".
>  But I can search for "Imagenes" and get the correct results.
>
>
>
> I am not familiar with Spanish but I found the above behavior confusing.
>  Can anybody please explain the behavior described above?
>
>
>
> Thank a million in advance
>
> Raj
>
>
>
>