Welcome Diego,
I think you’re right about MidLetter - adding a char to it should disable
splitting on that char, as long as there is a letter on one side or the other.
(If you’d like that behavior to be extended to numeric digits, you should use
MidNumLet instead.)
I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
(compressed whitespace diff below):
-MidLetter = (\p{WB:MidLetter} | {MidLetterSupp})
+MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
then running ‘ant jflex’ under lucene/analysis/common/, and the following text
was split as indicated (I tested by adding the method below to
TestStandardAnalyzer.java):
public void testMidLetterSlash() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
new String[]{ "one/two/three", "four" });
BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
new String[] { "1", "two", "3" });
}
So it works for me - are you regenerating the scanner (‘ant jflex’)?
FYI, I found a bug when I was testing the above: “http://example.com” is left
intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and
‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should
instead result in “http://example.com” being split into “http” and
“example.com”. Further testing indicates that this is a problem for MidLetter,
MidNumLet and MidNum. I’ve filed an issue:
<https://issues.apache.org/jira/browse/LUCENE-5447>.
Steve
On Feb 14, 2014, at 1:42 PM, Diego Fernandez <[email protected]> wrote:
> Hi guys, this is my first time posting on the Lucene list, so hello everyone.
>
> I really like the way that the StandardTokenizer works, however I'd like for
> it to not split tokens on / (forward slash). I've been looking at
> http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand
> the rules, but I'm either misunderstanding or missing something. If I
> understand correctly, the symbols in MidLetter keep it from splitting a token
> as long as there's alpha chars on either side. I tried adding the forward
> slash to the MidLetter and MidLetterSupp rules (tried different
> combinations), but it still seems like it's splitting on it.
>
> Does anyone have any tips or ideas?
>
> Thanks
>
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]