Re: Extending StandardTokenizer Jflex to not split on '/'

Diego Fernandez Thu, 20 Feb 2014 10:59:33 -0800

Thanks again for the help.  Upon further investigation I found out we weren't 
using our custom version of the analyzer, which explains why it wasn't doing 
what I thought it should.  When I have time to get back to it I'll reconfigure 
it to use our tokenizer.


Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Sorry, Diego, the generated scanner diff doesn't tell me anything.
> 
> Since I was able to successfully make changes to the open source and get
> the desired behavior, I'm guessing you're: a) not using the same (versions
> of) tools as me; b) not using the same (version of the) source as me; or c)
> not testing what you think you're testing.
> 
> So:
> 
> What version of Lucene?  What version of JFlex?  Are you using the Lucene
> build system, or some other mechanism to generate the scanner?  (If so,
> what is it?)
> 
> What other changes have you made?  (If you send me your grammar, I'll test
> it locally.)
> 
> Can you give an example of an input that should be split but isn't?
> 
> Are you sure you're testing the scanner generated from the modified grammar?
> 
> 
> 
> On Mon, Feb 17, 2014 at 5:04 PM, Diego Fernandez <difer...@redhat.com>wrote:
> 
> > Hey Steve, thanks for the quick reply.  I didn't have a chance to test
> > again until today.  In our Lucene build, we had already made some
> > customization to the JFlex file and it re-generates the java file whenever
> > we build our project.  Unfortunately, it is still not working for me.  I
> > diffed the generated java file before and after the JFlex change and here's
> > the result:
> >
> >
> > *** 71,77 ****
> >     private static final String ZZ_CMAP_PACKED =
> >       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> > !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
> >       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
> >       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
> >       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> > --- 71,77 ----
> >      */
> >     private static final String ZZ_CMAP_PACKED =
> >       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> > !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
> >       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
> >       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
> >       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> >
> >
> > Diego Fernandez - 爱国
> > Software Engineer
> > US GSS Supportability - Diagnostics
> >
> >
> > ----- Original Message -----
> > > Welcome Diego,
> > >
> > > I think you’re right about MidLetter - adding a char to it should disable
> > > splitting on that char, as long as there is a letter on one side or the
> > > other.  (If you’d like that behavior to be extended to numeric digits,
> > you
> > > should use MidNumLet instead.)
> > >
> > > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> > > (compressed whitespace diff below):
> > >
> > >     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
> > >     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> > >
> > > then running ‘ant jflex’ under lucene/analysis/common/, and the following
> > > text was split as indicated (I tested by adding the method below to
> > > TestStandardAnalyzer.java):
> > >
> > >   public void testMidLetterSlash() throws Exception {
> > >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
> > >                                   new String[]{ "one/two/three", "four"
> > });
> > >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
> > >                                  new String[] { "1", "two", "3" });
> > >   }
> > >
> > > So it works for me - are you regenerating the scanner (‘ant jflex’)?
> > >
> > > FYI, I found a bug when I was testing the above: “http://example.com”
> > is left
> > > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’
> > and
> > > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement
> > should
> > > instead result in “http://example.com” being split into “http” and
> > > “example.com”.  Further testing indicates that this is a problem for
> > > MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> > > <https://issues.apache.org/jira/browse/LUCENE-5447>.
> > >
> > > Steve
> > >
> > > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com>
> > wrote:
> > >
> > > > Hi guys, this is my first time posting on the Lucene list, so hello
> > > > everyone.
> > > >
> > > > I really like the way that the StandardTokenizer works, however I'd
> > like
> > > > for it to not split tokens on / (forward slash).  I've been looking at
> > > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > > > understand the rules, but I'm either misunderstanding or missing
> > > > something.  If I understand correctly, the symbols in MidLetter keep it
> > > > from splitting a token as long as there's alpha chars on either side.
> >  I
> > > > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > > > (tried different combinations), but it still seems like it's splitting
> > on
> > > > it.
> > > >
> > > > Does anyone have any tips or ideas?
> > > >
> > > > Thanks
> > > >
> > > > Diego Fernandez - 爱国
> > > > Software Engineer
> > > > US GSS Supportability - Diagnostics
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Reply via email to