Re: Question regarding searching Chinese characters

2018-08-14 Thread Christopher Beer
Hi all,

Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re 
currently working on upgrading from Solr 4 to 7 and we’re looking forward to 
using the new dictionary-based word splitting in the ICUTokenizer.

We have many of the same challenges as Amanda mentioned, and thanks to the 
advice on this thread, we’ve taken a stab at a CharFilter to do the traditional 
-> simplified transformation [1] and it seems to be promising and we've sent it 
out for testing by our subject matter experts for evaluation.

Thanks,
Chris

[1] 
https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java

On 2018/07/24 12:54:35, Tomoko Uchida  wrote:
Hi Amanda,>

do all I need to do is modify the settings from smartChinese to the ones>
you posted here>

Yes, the settings I posted should work for you, at least partially.>
If you are happy with the results, it's OK!>
But please take this as a starting point because it's not perfect.>

Or do I need to still do something with the SmartChineseAnalyzer?>

Try the settings, then if you notice something strange and want to know why>
and how to solve it, that may be the time to dive deep into. ;)>

I cannot explain how analyzers works here... but you should start off with>
the Solr documentation.>
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html>

Regards,>
Tomoko>



2018年7月24日(火) 21:08 Amanda Shuman :>

Hi Tomoko,>

Thanks so much for this explanation - I did not even know this was>
possible! I will try it out but I have one question: do all I need to do is>
modify the settings from smartChinese to the ones you posted here:>

>
>
>

id="Traditional-Simplified"/>>
>

Or do I need to still do something with the SmartChineseAnalyzer? I did not>
quite understand this in your first message:>

" I think you need two steps if you want to use HMMChineseTokenizer>
correctly.>

1. transform all traditional characters to simplified ones and save to>
temporary files.>
I do not have clear idea for doing this, but you can create a Java>
program that calls Lucene's ICUTransformFilter>
2. then, index to Solr using SmartChineseAnalyzer.">

My understanding is that with the new settings you posted, I don't need to>
do these steps. Is that correct? Otherwise, I don't really know how to do>
step 1 with the java program>

Thanks!>
Amanda>


-->
Dr. Amanda Shuman>
Post-doc researcher, University of Freiburg, The Maoist Legacy Project>
>
PhD, University of California, Santa Cruz>
http://www.amandashuman.net/>
http://www.prchistoryresources.org/>
Office: +49 (0) 761 203 4925>



Re: Question regarding searching Chinese characters

2018-07-24 Thread Tomoko Uchida
Hi Amanda,

> do all I need to do is modify the settings from smartChinese to the ones
you posted here

Yes, the settings I posted should work for you, at least partially.
If you are happy with the results, it's OK!
But please take this as a starting point because it's not perfect.

> Or do I need to still do something with the SmartChineseAnalyzer?

Try the settings, then if you notice something strange and want to know why
and how to solve it, that may be the time to dive deep into. ;)

I cannot explain how analyzers works here... but you should start off with
the Solr documentation.
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html

Regards,
Tomoko



2018年7月24日(火) 21:08 Amanda Shuman :

> Hi Tomoko,
>
> Thanks so much for this explanation - I did not even know this was
> possible! I will try it out but I have one question: do all I need to do is
> modify the settings from smartChinese to the ones you posted here:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Or do I need to still do something with the SmartChineseAnalyzer? I did not
> quite understand this in your first message:
>
> " I think you need two steps if you want to use HMMChineseTokenizer
> correctly.
>
> 1. transform all traditional characters to simplified ones and save to
> temporary files.
> I do not have clear idea for doing this, but you can create a Java
> program that calls Lucene's ICUTransformFilter
> 2. then, index to Solr using SmartChineseAnalyzer."
>
> My understanding is that with the new settings you posted, I don't need to
> do these steps. Is that correct? Otherwise, I don't really know how to do
> step 1 with the java program
>
> Thanks!
> Amanda
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> > wrote:
>
> > Yes, while traditional - simplified transformation would be out of the
> > scope of Unicode normalization,
> > you would like to add ICUNormalizer2CharFilterFactory anyway :)
> >
> > Let me refine my example settings:
> >
> > 
> >   
> >   
> >> id="Traditional-Simplified"/>
> > 
> >
> > Regards,
> > Tomoko
> >
> >
> > 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
> >
> > > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > > template of what needs to be done.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On 20 July 2018 at 12:40, Walter Underwood 
> > wrote:
> > > > Looks like we need a charfilter version of the ICU transforms. That
> > > could run before the tokenizer.
> > > >
> > > > I’ve never built a charfilter, but it seems like this would be a good
> > > first project for someone who wants to contribute.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > > tomoko.uchida.1...@gmail.com> wrote:
> > > >>
> > > >> Exactly. More concretely, the starting point is: replacing your
> > analyzer
> > > >>
> > > >>  > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > > >>
> > > >> to
> > > >>
> > > >> 
> > > >>  
> > > >>   > > >> id="Traditional-Simplified"/>
> > > >> 
> > > >>
> > > >> and see if the results are as expected. Then research another
> filters
> > if
> > > >> your requirements is not met.
> > > >>
> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle
> traditional
> > > >> characters as I noted previous in post, so ICUTransformFilterFactory
> > is
> > > an
> > > >> incomplete workaround.
> > > >>
> > > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > > >>
> > > >>> I expect that this is the line that does the transformation:
> > > >>>
> > > >>>> > >>> id="Traditional-Simplified"/>
> > > >>>
> > > >>> This mapping is a standard feature of ICU. More info on ICU
> > transforms
> > > is
> > > >>> in this doc, though not much detail on this particular transform.
> > > >>>
> > > >>> http://userguide.icu-project.org/transforms/general
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wun...@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar  >
> > > >>> wrote:
> > > 
> > >  I think so.  I used the exact as in github
> > > 
> > >   > >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> > >  
> > >    
> > >    
> > > > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > > > >>> id="Traditional-Simplified"/>
> > > > > >>> id="Katakana-Hiragana"/>
> > >    
> > > > >  hiragana="true" katakana="true" hangul="true"
> 

Re: Question regarding searching Chinese characters

2018-07-24 Thread Amanda Shuman
Hi Tomoko,

Thanks so much for this explanation - I did not even know this was
possible! I will try it out but I have one question: do all I need to do is
modify the settings from smartChinese to the ones you posted here:


  
  
  


Or do I need to still do something with the SmartChineseAnalyzer? I did not
quite understand this in your first message:

" I think you need two steps if you want to use HMMChineseTokenizer
correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer."

My understanding is that with the new settings you posted, I don't need to
do these steps. Is that correct? Otherwise, I don't really know how to do
step 1 with the java program

Thanks!
Amanda


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida  wrote:

> Yes, while traditional - simplified transformation would be out of the
> scope of Unicode normalization,
> you would like to add ICUNormalizer2CharFilterFactory anyway :)
>
> Let me refine my example settings:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Regards,
> Tomoko
>
>
> 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
>
> > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > template of what needs to be done.
> >
> > Regards,
> >Alex.
> >
> > On 20 July 2018 at 12:40, Walter Underwood 
> wrote:
> > > Looks like we need a charfilter version of the ICU transforms. That
> > could run before the tokenizer.
> > >
> > > I’ve never built a charfilter, but it seems like this would be a good
> > first project for someone who wants to contribute.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com> wrote:
> > >>
> > >> Exactly. More concretely, the starting point is: replacing your
> analyzer
> > >>
> > >>  > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > >>
> > >> to
> > >>
> > >> 
> > >>  
> > >>   > >> id="Traditional-Simplified"/>
> > >> 
> > >>
> > >> and see if the results are as expected. Then research another filters
> if
> > >> your requirements is not met.
> > >>
> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> > >> characters as I noted previous in post, so ICUTransformFilterFactory
> is
> > an
> > >> incomplete workaround.
> > >>
> > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > >>
> > >>> I expect that this is the line that does the transformation:
> > >>>
> > >>>> >>> id="Traditional-Simplified"/>
> > >>>
> > >>> This mapping is a standard feature of ICU. More info on ICU
> transforms
> > is
> > >>> in this doc, though not much detail on this particular transform.
> > >>>
> > >>> http://userguide.icu-project.org/transforms/general
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> > >>> wrote:
> > 
> >  I think so.  I used the exact as in github
> > 
> >   >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >    
> >    
> > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > >>> id="Traditional-Simplified"/>
> > > >>> id="Katakana-Hiragana"/>
> >    
> > >  hiragana="true" katakana="true" hangul="true" outputUnigrams="true"
> />
> >  
> >  
> > 
> > 
> > 
> >  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> > 
> >  wrote:
> > 
> > > Thanks! That does indeed look promising... This can be added on top
> > of
> > > Smart Chinese, right? Or is it an alternative?
> > >
> > >
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> > >
> > >
> > > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > susheel2...@gmail.com>
> > > wrote:
> > >
> > >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> and
> > >>> then
> > >> each of A, B or C or D in query and they seems to be matching and
> > CJKFF
> > > is
> > >> transforming the 舊 to 旧
> > >>
> > >> On Fri, Jul 20, 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Yes, while traditional - simplified transformation would be out of the
scope of Unicode normalization,
you would like to add ICUNormalizer2CharFilterFactory anyway :)

Let me refine my example settings:


  
  
  


Regards,
Tomoko


2018年7月21日(土) 2:54 Alexandre Rafalovitch :

> Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> template of what needs to be done.
>
> Regards,
>Alex.
>
> On 20 July 2018 at 12:40, Walter Underwood  wrote:
> > Looks like we need a charfilter version of the ICU transforms. That
> could run before the tokenizer.
> >
> > I’ve never built a charfilter, but it seems like this would be a good
> first project for someone who wants to contribute.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> >>
> >> Exactly. More concretely, the starting point is: replacing your analyzer
> >>
> >>  class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> >>
> >> to
> >>
> >> 
> >>  
> >>   >> id="Traditional-Simplified"/>
> >> 
> >>
> >> and see if the results are as expected. Then research another filters if
> >> your requirements is not met.
> >>
> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> >> characters as I noted previous in post, so ICUTransformFilterFactory is
> an
> >> incomplete workaround.
> >>
> >> 2018年7月21日(土) 0:05 Walter Underwood :
> >>
> >>> I expect that this is the line that does the transformation:
> >>>
> >>>>>> id="Traditional-Simplified"/>
> >>>
> >>> This mapping is a standard feature of ICU. More info on ICU transforms
> is
> >>> in this doc, though not much detail on this particular transform.
> >>>
> >>> http://userguide.icu-project.org/transforms/general
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> >>> wrote:
> 
>  I think so.  I used the exact as in github
> 
>    positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>    
>    
> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >>> id="Traditional-Simplified"/>
> >>> id="Katakana-Hiragana"/>
>    
>  hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
>  
> 
> 
> 
>  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> 
>  wrote:
> 
> > Thanks! That does indeed look promising... This can be added on top
> of
> > Smart Chinese, right? Or is it an alternative?
> >
> >
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
> >
> >
> > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> susheel2...@gmail.com>
> > wrote:
> >
> >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> >>> then
> >> each of A, B or C or D in query and they seems to be matching and
> CJKFF
> > is
> >> transforming the 舊 to 旧
> >>
> >> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> susheel2...@gmail.com>
> >> wrote:
> >>
> >>> Lack of my chinese language knowledge but if you want, I can do
> quick
> >> test
> >>> for you in Analysis tab if you can give me what to put in index and
> > query
> >>> window...
> >>>
> >>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> susheel2...@gmail.com
> 
> >>> wrote:
> >>>
>  Have you tried to use CJKFoldingFilter https://g
>  ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> > cover
>  your use case but I am using this filter and so far no issues.
> 
>  Thnx
> 
>  On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> >>>
>  wrote:
> 
> > Thanks, Alex - I have seen a few of those links but never
> considered
> > transliteration! We use lucene's Smart Chinese analyzer. The
> issue
> >>> is
> > basically what is laid out in the old blogspot post, namely this
> > point:
> >
> >
> > "Why approach CJK resource discovery differently?
> >
> > 2.  Search results must be as script agnostic as possible.
> >
> > There is more than one way to write each word. "Simplified"
> > characters
> > were
> > emphasized for printed materials in mainland China starting in
> the
> >> 1950s;
> > "Traditional" characters were used in printed materials prior 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
template of what needs to be done.

Regards,
   Alex.

On 20 July 2018 at 12:40, Walter Underwood  wrote:
> Looks like we need a charfilter version of the ICU transforms. That could run 
> before the tokenizer.
>
> I’ve never built a charfilter, but it seems like this would be a good first 
> project for someone who wants to contribute.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
>> wrote:
>>
>> Exactly. More concretely, the starting point is: replacing your analyzer
>>
>> 
>>
>> to
>>
>> 
>>  
>>  > id="Traditional-Simplified"/>
>> 
>>
>> and see if the results are as expected. Then research another filters if
>> your requirements is not met.
>>
>> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
>> characters as I noted previous in post, so ICUTransformFilterFactory is an
>> incomplete workaround.
>>
>> 2018年7月21日(土) 0:05 Walter Underwood :
>>
>>> I expect that this is the line that does the transformation:
>>>
>>>   >> id="Traditional-Simplified"/>
>>>
>>> This mapping is a standard feature of ICU. More info on ICU transforms is
>>> in this doc, though not much detail on this particular transform.
>>>
>>> http://userguide.icu-project.org/transforms/general
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
 On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>>> wrote:

 I think so.  I used the exact as in github

 >>> positionIncrementGap="1" autoGeneratePhraseQueries="false">
 
   
   
   
   >> id="Traditional-Simplified"/>
   >> id="Katakana-Hiragana"/>
   
   >>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
 
 



 On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >>>
 wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>>> then
>> each of A, B or C or D in query and they seems to be matching and CJKFF
> is
>> transforming the 舊 to 旧
>>
>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>> wrote:
>>
>>> Lack of my chinese language knowledge but if you want, I can do quick
>> test
>>> for you in Analysis tab if you can give me what to put in index and
> query
>>> window...
>>>
>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >>>
>>> wrote:
>>>
 Have you tried to use CJKFoldingFilter https://g
 ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
 your use case but I am using this filter and so far no issues.

 Thnx

 On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
>>>
 wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue
>>> is
> basically what is laid out in the old blogspot post, namely this
> point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified"
> characters
> were
> emphasized for printed materials in mainland China starting in the
>> 1950s;
> "Traditional" characters were used in printed materials prior to the
> 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are
> written
> in two scripts.
> Another way to think about it:  every written Chinese word has at
> least
> two
> completely different spellings.  And it can be mix-n-match:  a word
> can
> be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction),
>>> the
> results should include matches for 舊小說 (traditional) and 旧小说
>> (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
Looks like we need a charfilter version of the ICU transforms. That could run 
before the tokenizer.

I’ve never built a charfilter, but it seems like this would be a good first 
project for someone who wants to contribute.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
> wrote:
> 
> Exactly. More concretely, the starting point is: replacing your analyzer
> 
> 
> 
> to
> 
> 
>  
>   id="Traditional-Simplified"/>
> 
> 
> and see if the results are as expected. Then research another filters if
> your requirements is not met.
> 
> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> characters as I noted previous in post, so ICUTransformFilterFactory is an
> incomplete workaround.
> 
> 2018年7月21日(土) 0:05 Walter Underwood :
> 
>> I expect that this is the line that does the transformation:
>> 
>>   > id="Traditional-Simplified"/>
>> 
>> This mapping is a standard feature of ICU. More info on ICU transforms is
>> in this doc, though not much detail on this particular transform.
>> 
>> http://userguide.icu-project.org/transforms/general
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>> wrote:
>>> 
>>> I think so.  I used the exact as in github
>>> 
>>> >> positionIncrementGap="1" autoGeneratePhraseQueries="false">
>>> 
>>>   
>>>   
>>>   
>>>   > id="Traditional-Simplified"/>
>>>   > id="Katakana-Hiragana"/>
>>>   
>>>   >> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >> 
>>> wrote:
>>> 
 Thanks! That does indeed look promising... This can be added on top of
 Smart Chinese, right? Or is it an alternative?
 
 
 --
 Dr. Amanda Shuman
 Post-doc researcher, University of Freiburg, The Maoist Legacy Project
 
 PhD, University of California, Santa Cruz
 http://www.amandashuman.net/
 http://www.prchistoryresources.org/
 Office: +49 (0) 761 203 4925
 
 
 On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
 wrote:
 
> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>> then
> each of A, B or C or D in query and they seems to be matching and CJKFF
 is
> transforming the 舊 to 旧
> 
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
> 
>> Lack of my chinese language knowledge but if you want, I can do quick
> test
>> for you in Analysis tab if you can give me what to put in index and
 query
>> window...
>> 
>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >> 
>> wrote:
>> 
>>> Have you tried to use CJKFoldingFilter https://g
>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
 cover
>>> your use case but I am using this filter and so far no issues.
>>> 
>>> Thnx
>>> 
>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
 amanda.shu...@gmail.com
>> 
>>> wrote:
>>> 
 Thanks, Alex - I have seen a few of those links but never considered
 transliteration! We use lucene's Smart Chinese analyzer. The issue
>> is
 basically what is laid out in the old blogspot post, namely this
 point:
 
 
 "Why approach CJK resource discovery differently?
 
 2.  Search results must be as script agnostic as possible.
 
 There is more than one way to write each word. "Simplified"
 characters
 were
 emphasized for printed materials in mainland China starting in the
> 1950s;
 "Traditional" characters were used in printed materials prior to the
 1950s,
 and are still used in Taiwan, Hong Kong and Macau today.
 Since the characters are distinct, it's as if Chinese materials are
 written
 in two scripts.
 Another way to think about it:  every written Chinese word has at
 least
 two
 completely different spellings.  And it can be mix-n-match:  a word
 can
 be
 written with one traditional  and one simplified character.
 Example:   Given a user query 舊小說  (traditional for old fiction),
>> the
 results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
 characters for old fiction)"
 
 So, using the example provided above, we are dealing with materials
 produced in the 1950s-1970s that do even weirder things like:
 
 A. 舊小說
 
 can also be
 
 B. 旧小说 (all simplified)
 or
 C. 旧小說 (first character simplified, last character traditional)
 or
 D. 舊小 说 (first character traditional, last character simplified)
 
 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Exactly. More concretely, the starting point is: replacing your analyzer



to


  
  


and see if the results are as expected. Then research another filters if
your requirements is not met.

Just a reminder: HMMChineseTokenizerFactory do not handle traditional
characters as I noted previous in post, so ICUTransformFilterFactory is an
incomplete workaround.

2018年7月21日(土) 0:05 Walter Underwood :

> I expect that this is the line that does the transformation:
>
> id="Traditional-Simplified"/>
>
> This mapping is a standard feature of ICU. More info on ICU transforms is
> in this doc, though not much detail on this particular transform.
>
> http://userguide.icu-project.org/transforms/general
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> wrote:
> >
> > I think so.  I used the exact as in github
> >
> >  > positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >
> >
> >
> > id="Traditional-Simplified"/>
> > id="Katakana-Hiragana"/>
> >
> > > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >  
> > 
> >
> >
> >
> > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman  >
> > wrote:
> >
> >> Thanks! That does indeed look promising... This can be added on top of
> >> Smart Chinese, right? Or is it an alternative?
> >>
> >>
> >> --
> >> Dr. Amanda Shuman
> >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >> 
> >> PhD, University of California, Santa Cruz
> >> http://www.amandashuman.net/
> >> http://www.prchistoryresources.org/
> >> Office: +49 (0) 761 203 4925
> >>
> >>
> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> >> wrote:
> >>
> >>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> then
> >>> each of A, B or C or D in query and they seems to be matching and CJKFF
> >> is
> >>> transforming the 舊 to 旧
> >>>
> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> >>> wrote:
> >>>
>  Lack of my chinese language knowledge but if you want, I can do quick
> >>> test
>  for you in Analysis tab if you can give me what to put in index and
> >> query
>  window...
> 
>  On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar  >
>  wrote:
> 
> > Have you tried to use CJKFoldingFilter https://g
> > ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >> cover
> > your use case but I am using this filter and so far no issues.
> >
> > Thnx
> >
> > On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >> amanda.shu...@gmail.com
> 
> > wrote:
> >
> >> Thanks, Alex - I have seen a few of those links but never considered
> >> transliteration! We use lucene's Smart Chinese analyzer. The issue
> is
> >> basically what is laid out in the old blogspot post, namely this
> >> point:
> >>
> >>
> >> "Why approach CJK resource discovery differently?
> >>
> >> 2.  Search results must be as script agnostic as possible.
> >>
> >> There is more than one way to write each word. "Simplified"
> >> characters
> >> were
> >> emphasized for printed materials in mainland China starting in the
> >>> 1950s;
> >> "Traditional" characters were used in printed materials prior to the
> >> 1950s,
> >> and are still used in Taiwan, Hong Kong and Macau today.
> >> Since the characters are distinct, it's as if Chinese materials are
> >> written
> >> in two scripts.
> >> Another way to think about it:  every written Chinese word has at
> >> least
> >> two
> >> completely different spellings.  And it can be mix-n-match:  a word
> >> can
> >> be
> >> written with one traditional  and one simplified character.
> >> Example:   Given a user query 舊小說  (traditional for old fiction),
> the
> >> results should include matches for 舊小說 (traditional) and 旧小说
> >>> (simplified
> >> characters for old fiction)"
> >>
> >> So, using the example provided above, we are dealing with materials
> >> produced in the 1950s-1970s that do even weirder things like:
> >>
> >> A. 舊小說
> >>
> >> can also be
> >>
> >> B. 旧小说 (all simplified)
> >> or
> >> C. 旧小說 (first character simplified, last character traditional)
> >> or
> >> D. 舊小 说 (first character traditional, last character simplified)
> >>
> >> Thankfully the middle character was never simplified in recent
> times.
> >>
> >> From a historical standpoint, the mixed nature of the characters in
> >> the
> >> same word/phrase is because not all simplified characters were
> >> adopted
> >>> at
> >> the same time by everyone uniformly (good times...).
> >>
> >> The problem seems to be that Solr can easily handle A or B above,
> but
> >> NOT C
> >> or D using the Smart Chinese analyzer. I'm not really 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
I expect that this is the line that does the transformation:

   

This mapping is a standard feature of ICU. More info on ICU transforms is in 
this doc, though not much detail on this particular transform. 

http://userguide.icu-project.org/transforms/general

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 7:43 AM, Susheel Kumar  wrote:
> 
> I think so.  I used the exact as in github
> 
>  positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>
>
>
> id="Traditional-Simplified"/>
>
>
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
> 
> 
> 
> 
> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
> wrote:
> 
>> Thanks! That does indeed look promising... This can be added on top of
>> Smart Chinese, right? Or is it an alternative?
>> 
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
>> wrote:
>> 
>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>> is
>>> transforming the 舊 to 旧
>>> 
>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>>> wrote:
>>> 
 Lack of my chinese language knowledge but if you want, I can do quick
>>> test
 for you in Analysis tab if you can give me what to put in index and
>> query
 window...
 
 On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
 wrote:
 
> Have you tried to use CJKFoldingFilter https://g
> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>> cover
> your use case but I am using this filter and so far no issues.
> 
> Thnx
> 
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>> amanda.shu...@gmail.com
 
> wrote:
> 
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this
>> point:
>> 
>> 
>> "Why approach CJK resource discovery differently?
>> 
>> 2.  Search results must be as script agnostic as possible.
>> 
>> There is more than one way to write each word. "Simplified"
>> characters
>> were
>> emphasized for printed materials in mainland China starting in the
>>> 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at
>> least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word
>> can
>> be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说
>>> (simplified
>> characters for old fiction)"
>> 
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>> 
>> A. 舊小說
>> 
>> can also be
>> 
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>> 
>> Thankfully the middle character was never simplified in recent times.
>> 
>> From a historical standpoint, the mixed nature of the characters in
>> the
>> same word/phrase is because not all simplified characters were
>> adopted
>>> at
>> the same time by everyone uniformly (good times...).
>> 
>> The problem seems to be that Solr can easily handle A or B above, but
>> NOT C
>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>> change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>> 
>> Amanda
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>> Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>> 
>>> This is probably your start, if not read already:
>>> 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think so.  I used the exact as in github


  







  




On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
> > I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> > each of A, B or C or D in query and they seems to be matching and CJKFF
> is
> > transforming the 舊 to 旧
> >
> > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> > wrote:
> >
> > > Lack of my chinese language knowledge but if you want, I can do quick
> > test
> > > for you in Analysis tab if you can give me what to put in index and
> query
> > > window...
> > >
> > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > > wrote:
> > >
> > >> Have you tried to use CJKFoldingFilter https://g
> > >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
> > >> your use case but I am using this filter and so far no issues.
> > >>
> > >> Thnx
> > >>
> > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Thanks, Alex - I have seen a few of those links but never considered
> > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> > >>> basically what is laid out in the old blogspot post, namely this
> point:
> > >>>
> > >>>
> > >>> "Why approach CJK resource discovery differently?
> > >>>
> > >>> 2.  Search results must be as script agnostic as possible.
> > >>>
> > >>> There is more than one way to write each word. "Simplified"
> characters
> > >>> were
> > >>> emphasized for printed materials in mainland China starting in the
> > 1950s;
> > >>> "Traditional" characters were used in printed materials prior to the
> > >>> 1950s,
> > >>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>> Since the characters are distinct, it's as if Chinese materials are
> > >>> written
> > >>> in two scripts.
> > >>> Another way to think about it:  every written Chinese word has at
> least
> > >>> two
> > >>> completely different spellings.  And it can be mix-n-match:  a word
> can
> > >>> be
> > >>> written with one traditional  and one simplified character.
> > >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> > >>> results should include matches for 舊小說 (traditional) and 旧小说
> > (simplified
> > >>> characters for old fiction)"
> > >>>
> > >>> So, using the example provided above, we are dealing with materials
> > >>> produced in the 1950s-1970s that do even weirder things like:
> > >>>
> > >>> A. 舊小說
> > >>>
> > >>> can also be
> > >>>
> > >>> B. 旧小说 (all simplified)
> > >>> or
> > >>> C. 旧小說 (first character simplified, last character traditional)
> > >>> or
> > >>> D. 舊小 说 (first character traditional, last character simplified)
> > >>>
> > >>> Thankfully the middle character was never simplified in recent times.
> > >>>
> > >>> From a historical standpoint, the mixed nature of the characters in
> the
> > >>> same word/phrase is because not all simplified characters were
> adopted
> > at
> > >>> the same time by everyone uniformly (good times...).
> > >>>
> > >>> The problem seems to be that Solr can easily handle A or B above, but
> > >>> NOT C
> > >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> > change
> > >>> that at this point... maybe I should figure out how to contact the
> > >>> creators
> > >>> of the analyzer and ask them?
> > >>>
> > >>> Amanda
> > >>>
> > >>> --
> > >>> Dr. Amanda Shuman
> > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > >>> 
> > >>> PhD, University of California, Santa Cruz
> > >>> http://www.amandashuman.net/
> > >>> http://www.prchistoryresources.org/
> > >>> Office: +49 (0) 761 203 4925
> > >>>
> > >>>
> > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>> arafa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > This is probably your start, if not read already:
> > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> > >>> >
> > >>> > Otherwise, I think your answer would be somewhere around using
> ICU4J,
> > >>> > IBM's library for dealing with Unicode:
> http://site.icu-project.org/
> > >>> > (mentioned on the same page above)
> > >>> > Specifically, transformations:
> > >>> > http://userguide.icu-project.org/transforms/general
> > >>> >
> > >>> > With that, maybe you map both alphabets into latin. I did that once
> > >>> > for Thai for a demo:
> > >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> > >>> > 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Hi,

There is ICUTransformFilter (that included Solr distribution) which also
should be work for you.
See the example settings:
https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter

Combine it with HMMChineseTokenizer.
https://lucene.apache.org/solr/guide/7_4/language-analysis.html#hmm-chinese-tokenizer

In other words, replace your SmartChineseAnalyzer settings by
HMMChineseTokenizer
& ICUTransformFilter pipeline.


Here is a bit complicated explanation, so you can skip if you do not want
to go into analyzer details.

I do not understand Chinese, but seems there are no easy or one-stop
solutions in my view. (As Japanese, we have similar problems with Chinese.)

HMMChineseTokenizer expects Simplified Chinese text.
See:
https://lucene.apache.org/core/7_4_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.html

So you should transform all traditional Chinese characters **before**
applying HMMChineseTokenizer by CharFilters, otherwise the Tokenizer do not
correctly work.

Unfortunately, there is no such CharFilters as far as I know.
ICUNormalizer2CharFilter do not handle such transformation so it is no
help. CJKFoldingFilter and  ICUTransformFilter do the
traditional-simplified transformation, however, they are TokenFilters that
works after applying a Tokenizer.

I think you need two steps if you want to use HMMChineseTokenizer correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer.

Regards,
Tomoko

2018年7月20日(金) 22:12 Susheel Kumar :

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>> 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks! That does indeed look promising... This can be added on top of
Smart Chinese, right? Or is it an alternative?


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
wrote:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafa...@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
each of A, B or C or D in query and they seems to be matching and CJKFF is
transforming the 舊 to 旧

On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
wrote:

> Lack of my chinese language knowledge but if you want, I can do quick test
> for you in Analysis tab if you can give me what to put in index and query
> window...
>
> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> wrote:
>
>> Have you tried to use CJKFoldingFilter https://g
>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
>> your use case but I am using this filter and so far no issues.
>>
>> Thnx
>>
>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
>> wrote:
>>
>>> Thanks, Alex - I have seen a few of those links but never considered
>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>> basically what is laid out in the old blogspot post, namely this point:
>>>
>>>
>>> "Why approach CJK resource discovery differently?
>>>
>>> 2.  Search results must be as script agnostic as possible.
>>>
>>> There is more than one way to write each word. "Simplified" characters
>>> were
>>> emphasized for printed materials in mainland China starting in the 1950s;
>>> "Traditional" characters were used in printed materials prior to the
>>> 1950s,
>>> and are still used in Taiwan, Hong Kong and Macau today.
>>> Since the characters are distinct, it's as if Chinese materials are
>>> written
>>> in two scripts.
>>> Another way to think about it:  every written Chinese word has at least
>>> two
>>> completely different spellings.  And it can be mix-n-match:  a word can
>>> be
>>> written with one traditional  and one simplified character.
>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>>> characters for old fiction)"
>>>
>>> So, using the example provided above, we are dealing with materials
>>> produced in the 1950s-1970s that do even weirder things like:
>>>
>>> A. 舊小說
>>>
>>> can also be
>>>
>>> B. 旧小说 (all simplified)
>>> or
>>> C. 旧小說 (first character simplified, last character traditional)
>>> or
>>> D. 舊小 说 (first character traditional, last character simplified)
>>>
>>> Thankfully the middle character was never simplified in recent times.
>>>
>>> From a historical standpoint, the mixed nature of the characters in the
>>> same word/phrase is because not all simplified characters were adopted at
>>> the same time by everyone uniformly (good times...).
>>>
>>> The problem seems to be that Solr can easily handle A or B above, but
>>> NOT C
>>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>>> that at this point... maybe I should figure out how to contact the
>>> creators
>>> of the analyzer and ask them?
>>>
>>> Amanda
>>>
>>> --
>>> Dr. Amanda Shuman
>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>> 
>>> PhD, University of California, Santa Cruz
>>> http://www.amandashuman.net/
>>> http://www.prchistoryresources.org/
>>> Office: +49 (0) 761 203 4925
>>>
>>>
>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> wrote:
>>>
>>> > This is probably your start, if not read already:
>>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>> >
>>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>>> > (mentioned on the same page above)
>>> > Specifically, transformations:
>>> > http://userguide.icu-project.org/transforms/general
>>> >
>>> > With that, maybe you map both alphabets into latin. I did that once
>>> > for Thai for a demo:
>>> > https://github.com/arafalov/solr-thai-test/blob/master/
>>> > collection1/conf/schema.xml#L34
>>> >
>>> > The challenge is to figure out all the magic rules for that. You'd
>>> > have to dig through the ICU documentation and other web pages. I found
>>> > this one for example:
>>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>>> > transliterators-available-with-icu4j.html;jsessionid=
>>> > BEAB0AF05A588B97B8A2393054D908C0
>>> >
>>> > There is also 12 part series on Solr and Asian text processing, though
>>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>>> >
>>> > Hope one of these things help.
>>> >
>>> > Regards,
>>> >Alex.
>>> >
>>> >
>>> > On 20 July 2018 at 03:54, Amanda Shuman 
>>> wrote:
>>> > > Hi all,
>>> > >
>>> > > We have a problem. Some of our historical documents have mixed
>>> together
>>> > > simplified and Chinese characters. There seems to be no problem when
>>> > > searching either traditional or simplified separately - that is, if a
>>> > > particular string/phrase is all in traditional or simplified, it
>>> finds
>>> > it -
>>> > > but it does not find the string/phrase if the two different
>>> characters
>>> 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Lack of my chinese language knowledge but if you want, I can do quick test
for you in Analysis tab if you can give me what to put in index and query
window...

On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
wrote:

> Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/
> CJKFoldingFilter.  I am not sure if this would cover your use case but I
> am using this filter and so far no issues.
>
> Thnx
>
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
> wrote:
>
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this point:
>>
>>
>> "Why approach CJK resource discovery differently?
>>
>> 2.  Search results must be as script agnostic as possible.
>>
>> There is more than one way to write each word. "Simplified" characters
>> were
>> emphasized for printed materials in mainland China starting in the 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word can be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>> characters for old fiction)"
>>
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>>
>> A. 舊小說
>>
>> can also be
>>
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>>
>> Thankfully the middle character was never simplified in recent times.
>>
>> From a historical standpoint, the mixed nature of the characters in the
>> same word/phrase is because not all simplified characters were adopted at
>> the same time by everyone uniformly (good times...).
>>
>> The problem seems to be that Solr can easily handle A or B above, but NOT
>> C
>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>>
>> Amanda
>>
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>>
>>
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > This is probably your start, if not read already:
>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>> >
>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>> > (mentioned on the same page above)
>> > Specifically, transformations:
>> > http://userguide.icu-project.org/transforms/general
>> >
>> > With that, maybe you map both alphabets into latin. I did that once
>> > for Thai for a demo:
>> > https://github.com/arafalov/solr-thai-test/blob/master/
>> > collection1/conf/schema.xml#L34
>> >
>> > The challenge is to figure out all the magic rules for that. You'd
>> > have to dig through the ICU documentation and other web pages. I found
>> > this one for example:
>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>> > transliterators-available-with-icu4j.html;jsessionid=
>> > BEAB0AF05A588B97B8A2393054D908C0
>> >
>> > There is also 12 part series on Solr and Asian text processing, though
>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>> >
>> > Hope one of these things help.
>> >
>> > Regards,
>> >Alex.
>> >
>> >
>> > On 20 July 2018 at 03:54, Amanda Shuman 
>> wrote:
>> > > Hi all,
>> > >
>> > > We have a problem. Some of our historical documents have mixed
>> together
>> > > simplified and Chinese characters. There seems to be no problem when
>> > > searching either traditional or simplified separately - that is, if a
>> > > particular string/phrase is all in traditional or simplified, it finds
>> > it -
>> > > but it does not find the string/phrase if the two different characters
>> > (one
>> > > traditional, one simplified) are mixed together in the SAME
>> > string/phrase.
>> > >
>> > > Has anyone ever handled this problem before? I know some libraries
>> seem
>> > to
>> > > have implemented something that seems to be able to handle this, but
>> I'm
>> > > not sure how they did so!
>> > >
>> > > Amanda
>> > > --
>> > > Dr. Amanda Shuman
>> > 

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Have you tried to use CJKFoldingFilter
https://github.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
cover your use case but I am using this filter and so far no issues.

Thnx

On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> basically what is laid out in the old blogspot post, namely this point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified" characters were
> emphasized for printed materials in mainland China starting in the 1950s;
> "Traditional" characters were used in printed materials prior to the 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are written
> in two scripts.
> Another way to think about it:  every written Chinese word has at least two
> completely different spellings.  And it can be mix-n-match:  a word can be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction), the
> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things like:
>
> A. 舊小說
>
> can also be
>
> B. 旧小说 (all simplified)
> or
> C. 旧小說 (first character simplified, last character traditional)
> or
> D. 舊小 说 (first character traditional, last character simplified)
>
> Thankfully the middle character was never simplified in recent times.
>
> From a historical standpoint, the mixed nature of the characters in the
> same word/phrase is because not all simplified characters were adopted at
> the same time by everyone uniformly (good times...).
>
> The problem seems to be that Solr can easily handle A or B above, but NOT C
> or D using the Smart Chinese analyzer. I'm not really sure how to change
> that at this point... maybe I should figure out how to contact the creators
> of the analyzer and ask them?
>
> Amanda
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch  >
> wrote:
>
> > This is probably your start, if not read already:
> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >
> > Otherwise, I think your answer would be somewhere around using ICU4J,
> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> > (mentioned on the same page above)
> > Specifically, transformations:
> > http://userguide.icu-project.org/transforms/general
> >
> > With that, maybe you map both alphabets into latin. I did that once
> > for Thai for a demo:
> > https://github.com/arafalov/solr-thai-test/blob/master/
> > collection1/conf/schema.xml#L34
> >
> > The challenge is to figure out all the magic rules for that. You'd
> > have to dig through the ICU documentation and other web pages. I found
> > this one for example:
> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > transliterators-available-with-icu4j.html;jsessionid=
> > BEAB0AF05A588B97B8A2393054D908C0
> >
> > There is also 12 part series on Solr and Asian text processing, though
> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >
> > Hope one of these things help.
> >
> > Regards,
> >Alex.
> >
> >
> > On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > > Hi all,
> > >
> > > We have a problem. Some of our historical documents have mixed together
> > > simplified and Chinese characters. There seems to be no problem when
> > > searching either traditional or simplified separately - that is, if a
> > > particular string/phrase is all in traditional or simplified, it finds
> > it -
> > > but it does not find the string/phrase if the two different characters
> > (one
> > > traditional, one simplified) are mixed together in the SAME
> > string/phrase.
> > >
> > > Has anyone ever handled this problem before? I know some libraries seem
> > to
> > > have implemented something that seems to be able to handle this, but
> I'm
> > > not sure how they did so!
> > >
> > > Amanda
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> >
>


Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks, Alex - I have seen a few of those links but never considered
transliteration! We use lucene's Smart Chinese analyzer. The issue is
basically what is laid out in the old blogspot post, namely this point:


"Why approach CJK resource discovery differently?

2.  Search results must be as script agnostic as possible.

There is more than one way to write each word. "Simplified" characters were
emphasized for printed materials in mainland China starting in the 1950s;
"Traditional" characters were used in printed materials prior to the 1950s,
and are still used in Taiwan, Hong Kong and Macau today.
Since the characters are distinct, it's as if Chinese materials are written
in two scripts.
Another way to think about it:  every written Chinese word has at least two
completely different spellings.  And it can be mix-n-match:  a word can be
written with one traditional  and one simplified character.
Example:   Given a user query 舊小說  (traditional for old fiction), the
results should include matches for 舊小說 (traditional) and 旧小说 (simplified
characters for old fiction)"

So, using the example provided above, we are dealing with materials
produced in the 1950s-1970s that do even weirder things like:

A. 舊小說

can also be

B. 旧小说 (all simplified)
or
C. 旧小說 (first character simplified, last character traditional)
or
D. 舊小 说 (first character traditional, last character simplified)

Thankfully the middle character was never simplified in recent times.

>From a historical standpoint, the mixed nature of the characters in the
same word/phrase is because not all simplified characters were adopted at
the same time by everyone uniformly (good times...).

The problem seems to be that Solr can easily handle A or B above, but NOT C
or D using the Smart Chinese analyzer. I'm not really sure how to change
that at this point... maybe I should figure out how to contact the creators
of the analyzer and ask them?

Amanda

--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch 
wrote:

> This is probably your start, if not read already:
> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>
> Otherwise, I think your answer would be somewhere around using ICU4J,
> IBM's library for dealing with Unicode: http://site.icu-project.org/
> (mentioned on the same page above)
> Specifically, transformations:
> http://userguide.icu-project.org/transforms/general
>
> With that, maybe you map both alphabets into latin. I did that once
> for Thai for a demo:
> https://github.com/arafalov/solr-thai-test/blob/master/
> collection1/conf/schema.xml#L34
>
> The challenge is to figure out all the magic rules for that. You'd
> have to dig through the ICU documentation and other web pages. I found
> this one for example:
> http://avajava.com/tutorials/lessons/what-are-the-system-
> transliterators-available-with-icu4j.html;jsessionid=
> BEAB0AF05A588B97B8A2393054D908C0
>
> There is also 12 part series on Solr and Asian text processing, though
> it is a bit old now: http://discovery-grindstone.blogspot.com/
>
> Hope one of these things help.
>
> Regards,
>Alex.
>
>
> On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > Hi all,
> >
> > We have a problem. Some of our historical documents have mixed together
> > simplified and Chinese characters. There seems to be no problem when
> > searching either traditional or simplified separately - that is, if a
> > particular string/phrase is all in traditional or simplified, it finds
> it -
> > but it does not find the string/phrase if the two different characters
> (one
> > traditional, one simplified) are mixed together in the SAME
> string/phrase.
> >
> > Has anyone ever handled this problem before? I know some libraries seem
> to
> > have implemented something that seems to be able to handle this, but I'm
> > not sure how they did so!
> >
> > Amanda
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
>


Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
This is probably your start, if not read already:
https://lucene.apache.org/solr/guide/7_4/language-analysis.html

Otherwise, I think your answer would be somewhere around using ICU4J,
IBM's library for dealing with Unicode: http://site.icu-project.org/
(mentioned on the same page above)
Specifically, transformations:
http://userguide.icu-project.org/transforms/general

With that, maybe you map both alphabets into latin. I did that once
for Thai for a demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

The challenge is to figure out all the magic rules for that. You'd
have to dig through the ICU documentation and other web pages. I found
this one for example:
http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0

There is also 12 part series on Solr and Asian text processing, though
it is a bit old now: http://discovery-grindstone.blogspot.com/

Hope one of these things help.

Regards,
   Alex.


On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> Hi all,
>
> We have a problem. Some of our historical documents have mixed together
> simplified and Chinese characters. There seems to be no problem when
> searching either traditional or simplified separately - that is, if a
> particular string/phrase is all in traditional or simplified, it finds it -
> but it does not find the string/phrase if the two different characters (one
> traditional, one simplified) are mixed together in the SAME string/phrase.
>
> Has anyone ever handled this problem before? I know some libraries seem to
> have implemented something that seems to be able to handle this, but I'm
> not sure how they did so!
>
> Amanda
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925