Re: Extending StandardTokenizer Jflex to not split on '/'

Diego Fernandez Mon, 17 Feb 2014 14:05:37 -0800

Hey Steve, thanks for the quick reply.  I didn't have a chance to test again 
until today.  In our Lucene build, we had already made some customization to 
the JFlex file and it re-generates the java file whenever we build our project. 
 Unfortunately, it is still not working for me.  I diffed the generated java 
file before and after the JFlex change and here's the result:



*** 71,77 ****
    private static final String ZZ_CMAP_PACKED = 
      "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
!     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
      "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
      "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
      "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
--- 71,77 ----
     */
    private static final String ZZ_CMAP_PACKED = 
      "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
!     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
      "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
      "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
      "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+


Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Welcome Diego,
> 
> I think you’re right about MidLetter - adding a char to it should disable
> splitting on that char, as long as there is a letter on one side or the
> other.  (If you’d like that behavior to be extended to numeric digits, you
> should use MidNumLet instead.)
> 
> I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> (compressed whitespace diff below):
> 
>     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
>     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> 
> then running ‘ant jflex’ under lucene/analysis/common/, and the following
> text was split as indicated (I tested by adding the method below to
> TestStandardAnalyzer.java):
> 
>   public void testMidLetterSlash() throws Exception {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
>                                   new String[]{ "one/two/three", "four" });
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
>                                  new String[] { "1", "two", "3" });
>   }
> 
> So it works for me - are you regenerating the scanner (‘ant jflex’)?
> 
> FYI, I found a bug when I was testing the above: “http://example.com” is left
> intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and
> ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should
> instead result in “http://example.com” being split into “http” and
> “example.com”.  Further testing indicates that this is a problem for
> MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> <https://issues.apache.org/jira/browse/LUCENE-5447>.
> 
> Steve
> 
> On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com> wrote:
> 
> > Hi guys, this is my first time posting on the Lucene list, so hello
> > everyone.
> > 
> > I really like the way that the StandardTokenizer works, however I'd like
> > for it to not split tokens on / (forward slash).  I've been looking at
> > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > understand the rules, but I'm either misunderstanding or missing
> > something.  If I understand correctly, the symbols in MidLetter keep it
> > from splitting a token as long as there's alpha chars on either side.  I
> > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > (tried different combinations), but it still seems like it's splitting on
> > it.
> > 
> > Does anyone have any tips or ideas?
> > 
> > Thanks
> > 
> > Diego Fernandez - 爱国
> > Software Engineer
> > US GSS Supportability - Diagnostics
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Reply via email to