Hey Steve, thanks for the quick reply. I didn't have a chance to test again
until today. In our Lucene build, we had already made some customization to
the JFlex file and it re-generates the java file whenever we build our project.
Unfortunately, it is still not working for me. I diffed the generated java
file before and after the JFlex change and here's the result:
*** 71,77 ****
private static final String ZZ_CMAP_PACKED =
"\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
"\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
"\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
"\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
--- 71,77 ----
*/
private static final String ZZ_CMAP_PACKED =
"\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
"\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
"\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
"\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics
----- Original Message -----
> Welcome Diego,
>
> I think you’re right about MidLetter - adding a char to it should disable
> splitting on that char, as long as there is a letter on one side or the
> other. (If you’d like that behavior to be extended to numeric digits, you
> should use MidNumLet instead.)
>
> I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> (compressed whitespace diff below):
>
> -MidLetter = (\p{WB:MidLetter} | {MidLetterSupp})
> +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
>
> then running ‘ant jflex’ under lucene/analysis/common/, and the following
> text was split as indicated (I tested by adding the method below to
> TestStandardAnalyzer.java):
>
> public void testMidLetterSlash() throws Exception {
> BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
> new String[]{ "one/two/three", "four" });
> BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
> new String[] { "1", "two", "3" });
> }
>
> So it works for me - are you regenerating the scanner (‘ant jflex’)?
>
> FYI, I found a bug when I was testing the above: “http://example.com” is left
> intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and
> ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should
> instead result in “http://example.com” being split into “http” and
> “example.com”. Further testing indicates that this is a problem for
> MidLetter, MidNumLet and MidNum. I’ve filed an issue:
> <https://issues.apache.org/jira/browse/LUCENE-5447>.
>
> Steve
>
> On Feb 14, 2014, at 1:42 PM, Diego Fernandez <[email protected]> wrote:
>
> > Hi guys, this is my first time posting on the Lucene list, so hello
> > everyone.
> >
> > I really like the way that the StandardTokenizer works, however I'd like
> > for it to not split tokens on / (forward slash). I've been looking at
> > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > understand the rules, but I'm either misunderstanding or missing
> > something. If I understand correctly, the symbols in MidLetter keep it
> > from splitting a token as long as there's alpha chars on either side. I
> > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > (tried different combinations), but it still seems like it's splitting on
> > it.
> >
> > Does anyone have any tips or ideas?
> >
> > Thanks
> >
> > Diego Fernandez - 爱国
> > Software Engineer
> > US GSS Supportability - Diagnostics
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]