Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-25 Thread Charles Riley
 "the encoding of the character used for alif (02BE) carries with it an
assigned property in the Unicode database of (Lm), putting it into the
category of 'Modifier_Letter'..."

Correction to what I put there:  02BC, rather.  The rest of that still
holds up; the data I'm looking at regarding properties can be found here:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
http://www.unicode.org/reports/tr44/#Property_Values
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

Charles


Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Charles Riley
True, no argument there as to usage.

I should have clarified that the encoding of the character used for alif
(02BE) carries with it an assigned property in the Unicode database of
(Lm), putting it into the category of 'Modifier_Letter', which contrasts
with the property (Sk), 'Modifier_Symbol', a property assigned to
characters that are more commonly used as diacritics.

I think the inclusion of characters into the filter factories was
determined based on these properties as assigned, though yes, there's often
a broader range of uses that each character is actually used for.

Charles


On Thu, May 24, 2012 at 1:41 PM, Naomi Dushay  wrote:

> The alif and ayn can also be used as diacritic-like characters in Korean;
>  this is a known practice.   But thanks anyway.
>
> On May 24, 2012, at 9:30 AM, Charles Riley wrote:
>
> Hi Naomi,
>
> I don't have a conclusive answer for you on this yet, but let me pick up
> on a few points.
>
> First, the apostrophe is probably being handled through ignoring
> punctuation in the ICUCollationKeyFilterFactory.
>
> Alif isn't a diacritic but a letter, and its character properties would be
> handled as such, apparently also outside the scope of what the folding
> filter factory does unless it's tailored.
>
> From the solrwiki, this looks like a helpful rule of thumb:
>
> "When To use a CharFilter vs a TokenFilter
>
> There are several pairs of CharFilters and TokenFilters that have related
> (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical
> functionality (ie: PatternReplaceCharFilterFactory and
> PatternReplaceFilterFactory) and it may not always be obvious which is the
> best choice.
>
> The ultimate decision depends largely on what Tokenizer you are using, and
> whether you need to "out smart" it by preprocessing the stream of
> characters.
>
> For example, maybe you have a tokenizer such as StandardTokenizer and you
> are pretty happy with how it works overall, but you want to customize how
> some specific characters behave.
> In such a situation you could modify the rules and re-build your own
> tokenizer with javacc, but perhaps its easier to simply map some of the
> characters before tokenization with a CharFilter."
>
>
> Charles
>
> On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay wrote:
>
>> We are using the ICUFoldingFilterFactory with great success to fold
>> diacritics so searches with and without the diacritics get the same results.
>>
>> We recently discovered we have some Korean records that use an alif
>> diacritic instead of an apostrophe, and this diacritic is NOT getting
>> folded.   Has anyone experienced this for alif or ayn characters?   Do you
>> have a solution?
>>
>>
>> - Naomi
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-t...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> solrmarc-tech+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/solrmarc-tech?hl=en.
>>
>>
>
>
> --
> *Charles L. Riley*
> *Catalog Librarian for Africana*
> *Sterling Memorial Library, Yale University*
> *<**zenodo...@gmail.com* *>*
> *203-432-7566*
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>



-- 
*Charles L. Riley*
*Catalog Librarian for Africana*
*Sterling Memorial Library, Yale University*
*<**zenodo...@gmail.com* *>*
*203-432-7566*


Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Naomi Dushay
The alif and ayn can also be used as diacritic-like characters in Korean;  this 
is a known practice.   But thanks anyway.

On May 24, 2012, at 9:30 AM, Charles Riley wrote:

> Hi Naomi,
> 
> I don't have a conclusive answer for you on this yet, but let me pick up on a 
> few points.
> 
> First, the apostrophe is probably being handled through ignoring punctuation 
> in the ICUCollationKeyFilterFactory.  
> 
> Alif isn't a diacritic but a letter, and its character properties would be 
> handled as such, apparently also outside the scope of what the folding filter 
> factory does unless it's tailored.
> 
> From the solrwiki, this looks like a helpful rule of thumb:
> 
> "When To use a CharFilter vs a TokenFilter
> There are several pairs of CharFilters and TokenFilters that have related 
> (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical 
> functionality (ie: PatternReplaceCharFilterFactory and 
> PatternReplaceFilterFactory) and it may not always be obvious which is the 
> best choice.
> 
> The ultimate decision depends largely on what Tokenizer you are using, and 
> whether you need to "out smart" it by preprocessing the stream of characters.
> 
> For example, maybe you have a tokenizer such as StandardTokenizer and you are 
> pretty happy with how it works overall, but you want to customize how some 
> specific characters behave.
> 
> In such a situation you could modify the rules and re-build your own 
> tokenizer with javacc, but perhaps its easier to simply map some of the 
> characters before tokenization with a CharFilter."
> 
> 
> Charles
> 
> On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay  wrote:
> We are using the ICUFoldingFilterFactory with great success to fold 
> diacritics so searches with and without the diacritics get the same results.
> 
> We recently discovered we have some Korean records that use an alif diacritic 
> instead of an apostrophe, and this diacritic is NOT getting folded.   Has 
> anyone experienced this for alif or ayn characters?   Do you have a solution?
> 
> 
> - Naomi
> 
> --
> You received this message because you are subscribed to the Google Groups 
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to 
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/solrmarc-tech?hl=en.
> 
> 
> 
> 
> -- 
> Charles L. Riley
> Catalog Librarian for Africana
> Sterling Memorial Library, Yale University
> 
> 203-432-7566
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to 
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/solrmarc-tech?hl=en.



Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Charles Riley
Hi Naomi,

I don't have a conclusive answer for you on this yet, but let me pick up on
a few points.

First, the apostrophe is probably being handled through ignoring
punctuation in the ICUCollationKeyFilterFactory.

Alif isn't a diacritic but a letter, and its character properties would be
handled as such, apparently also outside the scope of what the folding
filter factory does unless it's tailored.

>From the solrwiki, this looks like a helpful rule of thumb:

"When To use a CharFilter vs a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related
(ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical
functionality (ie: PatternReplaceCharFilterFactory and
PatternReplaceFilterFactory) and it may not always be obvious which is the
best choice.

The ultimate decision depends largely on what Tokenizer you are using, and
whether you need to "out smart" it by preprocessing the stream of
characters.

For example, maybe you have a tokenizer such as StandardTokenizer and you
are pretty happy with how it works overall, but you want to customize how
some specific characters behave.
In such a situation you could modify the rules and re-build your own
tokenizer with javacc, but perhaps its easier to simply map some of the
characters before tokenization with a CharFilter."


Charles

On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay  wrote:

> We are using the ICUFoldingFilterFactory with great success to fold
> diacritics so searches with and without the diacritics get the same results.
>
> We recently discovered we have some Korean records that use an alif
> diacritic instead of an apostrophe, and this diacritic is NOT getting
> folded.   Has anyone experienced this for alif or ayn characters?   Do you
> have a solution?
>
>
> - Naomi
>
> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>


-- 
*Charles L. Riley*
*Catalog Librarian for Africana*
*Sterling Memorial Library, Yale University*
*<**zenodo...@gmail.com* *>*
*203-432-7566*