Greg Pendlebury created SOLR-5722:
-------------------------------------

             Summary: Add catenateShingles option to WordDelimiterFilter
                 Key: SOLR-5722
                 URL: https://issues.apache.org/jira/browse/SOLR-5722
             Project: Solr
          Issue Type: Improvement
            Reporter: Greg Pendlebury
            Priority: Minor


Apologies if I put this in the wrong spot. I'm attaching a patch (against 
current trunk) that adds support for a 'catenateShingles' option to the 
WordDelimiterFilter. 

We (National Library of Australia - NLA) are currently maintaining this as an 
internal modification to the Filter, but I believe it is generic enough to 
contribute upstream.

Description:
=========
{code}
/**
 * NLA Modification to the standard word delimiter to support various
 * hyphenation use cases. Primarily driven by requirements for
 * newspapers where words are often broken across line endings.
 *
 *  eg. "hyphenated-surname" is printed printed across a line ending and
 *         turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
 *
 *  In this scenario the stock filter, with 'catenateAll' turned on, will
 *  generate individual tokens plus one combined token, but not
 *  sub-tokens like "hyphenated surname" and "hyphenatedsur name".
 *
 *  So we add a new 'catenateShingles' to achieve this.
*/
{code}

Includes unit tests, and as is noted in one of them CATENATE_WORDS and 
CATENATE_SHINGLES are logically considered mutually exclusive for sensible 
usage and can cause duplicate tokens (although they should have the same 
positions etc).

I'm happy to work on it more if anyone finds problems with it.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to