Re: Extending StandardTokenizer Jflex to not split on '/'

Steve Rowe Mon, 17 Feb 2014 14:56:27 -0800

Sorry, Diego, the generated scanner diff doesn't tell me anything.

Since I was able to successfully make changes to the open source and get
the desired behavior, I'm guessing you're: a) not using the same (versions
of) tools as me; b) not using the same (version of the) source as me; or c)
not testing what you think you're testing.


So:

What version of Lucene?  What version of JFlex?  Are you using the Lucene
build system, or some other mechanism to generate the scanner?  (If so,
what is it?)

What other changes have you made?  (If you send me your grammar, I'll test
it locally.)

Can you give an example of an input that should be split but isn't?

Are you sure you're testing the scanner generated from the modified grammar?



On Mon, Feb 17, 2014 at 5:04 PM, Diego Fernandez <[email protected]>wrote:

> Hey Steve, thanks for the quick reply.  I didn't have a chance to test
> again until today.  In our Lucene build, we had already made some
> customization to the JFlex file and it re-generates the java file whenever
> we build our project.  Unfortunately, it is still not working for me.  I
> diffed the generated java file before and after the JFlex change and here's
> the result:
>
>
> *** 71,77 ****
>     private static final String ZZ_CMAP_PACKED =
>       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
>       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
>       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
>       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> --- 71,77 ----
>      */
>     private static final String ZZ_CMAP_PACKED =
>       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
>       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
>       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
>       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
>
>
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
>
>
> ----- Original Message -----
> > Welcome Diego,
> >
> > I think you’re right about MidLetter - adding a char to it should disable
> > splitting on that char, as long as there is a letter on one side or the
> > other.  (If you’d like that behavior to be extended to numeric digits,
> you
> > should use MidNumLet instead.)
> >
> > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> > (compressed whitespace diff below):
> >
> >     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
> >     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> >
> > then running ‘ant jflex’ under lucene/analysis/common/, and the following
> > text was split as indicated (I tested by adding the method below to
> > TestStandardAnalyzer.java):
> >
> >   public void testMidLetterSlash() throws Exception {
> >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
> >                                   new String[]{ "one/two/three", "four"
> });
> >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
> >                                  new String[] { "1", "two", "3" });
> >   }
> >
> > So it works for me - are you regenerating the scanner (‘ant jflex’)?
> >
> > FYI, I found a bug when I was testing the above: “http://example.com”
> is left
> > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’
> and
> > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement
> should
> > instead result in “http://example.com” being split into “http” and
> > “example.com”.  Further testing indicates that this is a problem for
> > MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> > <https://issues.apache.org/jira/browse/LUCENE-5447>.
> >
> > Steve
> >
> > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <[email protected]>
> wrote:
> >
> > > Hi guys, this is my first time posting on the Lucene list, so hello
> > > everyone.
> > >
> > > I really like the way that the StandardTokenizer works, however I'd
> like
> > > for it to not split tokens on / (forward slash).  I've been looking at
> > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > > understand the rules, but I'm either misunderstanding or missing
> > > something.  If I understand correctly, the symbols in MidLetter keep it
> > > from splitting a token as long as there's alpha chars on either side.
>  I
> > > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > > (tried different combinations), but it still seems like it's splitting
> on
> > > it.
> > >
> > > Does anyone have any tips or ideas?
> > >
> > > Thanks
> > >
> > > Diego Fernandez - 爱国
> > > Software Engineer
> > > US GSS Supportability - Diagnostics
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Extending StandardTokenizer Jflex to not split on '/'

Reply via email to