Hey Steve, thanks for the quick reply. I didn't have a chance to test again until today. In our Lucene build, we had already made some customization to the JFlex file and it re-generates the java file whenever we build our project. Unfortunately, it is still not working for me. I diffed the generated java file before and after the JFlex change and here's the result:
*** 71,77 **** private static final String ZZ_CMAP_PACKED = "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+ ! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+ "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+ "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+ "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+ --- 71,77 ---- */ private static final String ZZ_CMAP_PACKED = "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+ ! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+ "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+ "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+ "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+ Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics ----- Original Message ----- > Welcome Diego, > > I think you’re right about MidLetter - adding a char to it should disable > splitting on that char, as long as there is a letter on one side or the > other. (If you’d like that behavior to be extended to numeric digits, you > should use MidNumLet instead.) > > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex > (compressed whitespace diff below): > > -MidLetter = (\p{WB:MidLetter} | {MidLetterSupp}) > +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp}) > > then running ‘ant jflex’ under lucene/analysis/common/, and the following > text was split as indicated (I tested by adding the method below to > TestStandardAnalyzer.java): > > public void testMidLetterSlash() throws Exception { > BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, > new String[]{ "one/two/three", "four" }); > BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, > new String[] { "1", "two", "3" }); > } > > So it works for me - are you regenerating the scanner (‘ant jflex’)? > > FYI, I found a bug when I was testing the above: “http://example.com” is left > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should > instead result in “http://example.com” being split into “http” and > “example.com”. Further testing indicates that this is a problem for > MidLetter, MidNumLet and MidNum. I’ve filed an issue: > <https://issues.apache.org/jira/browse/LUCENE-5447>. > > Steve > > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com> wrote: > > > Hi guys, this is my first time posting on the Lucene list, so hello > > everyone. > > > > I really like the way that the StandardTokenizer works, however I'd like > > for it to not split tokens on / (forward slash). I've been looking at > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to > > understand the rules, but I'm either misunderstanding or missing > > something. If I understand correctly, the symbols in MidLetter keep it > > from splitting a token as long as there's alpha chars on either side. I > > tried adding the forward slash to the MidLetter and MidLetterSupp rules > > (tried different combinations), but it still seems like it's splitting on > > it. > > > > Does anyone have any tips or ideas? > > > > Thanks > > > > Diego Fernandez - 爱国 > > Software Engineer > > US GSS Supportability - Diagnostics > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org