[ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419326#comment-13419326
 ] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

bq. Actually there are factories in contrib/analysis-extras.
You're right, I was thinking of a previous project.
bq. I am not sure on this: if someone wants to mix an n-gram technique with a 
word model, they can just use two fields? If they want to limit the n-gram 
field to only longer terms, they should use LengthFilter.

Is this the design?
{code}
Word-based field: 
    SmartChineseWordTokenFilter -> LengthFilter accept 1-3 letters
Bigram-based field:
    SmartChineseWordTokenFilter -> LengthFilter accept 4 or longer -> 
Chinese-only bigrams
{code}
This works if the user searches simple words, like on a consumer site. In the 
legal document site, people block-copy 60-word document titles and expect to 
find the matching title first on the list. This requires a phrase search where 
0 variations in position gives the exact title. If the two classes of terms are 
in two different fields, will that work? I did not think parsers did 

Also, this design needs to allow for mixed language text: year numbers, English 
words. Are the existing Lucene filters flexible enough to do this?

bq. The word you are upset about (中华人民共和国) is in the smartcn dictionary. As I 
understand, this word basically means "PRC". This is a single concept and makes 
sense as an indexing unit. Why do we care how long it is in characters?

Because parts of it are also words, which should be searchable. Here are two 
more failed words: "个人所得税" (personal/individual "income tax") and "社会保险" 
(National Congress, political body). I can imagine Congress would be in the 
dictionary, but "personal income tax"? If you search for income tax: "所得税" you 
will not find personal income tax. This points up a flaw: the bigram trick will 
not find this trigram.

How do you know what's in the dictionary? The files are in a .mem format. I 
can't find a main program for them.



                
> Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr 
> factories. Also, since it is a statistical algorithm, it is not perfect.
> This patch supplies factories and a schema.xml type for the existing Lucene 
> Smart Chinese implementation, and includes a "fixup" class to handle the 
> occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to