Thanks again for the help. Upon further investigation I found out we weren't using our custom version of the analyzer, which explains why it wasn't doing what I thought it should. When I have time to get back to it I'll reconfigure it to use our tokenizer.
Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics ----- Original Message ----- > Sorry, Diego, the generated scanner diff doesn't tell me anything. > > Since I was able to successfully make changes to the open source and get > the desired behavior, I'm guessing you're: a) not using the same (versions > of) tools as me; b) not using the same (version of the) source as me; or c) > not testing what you think you're testing. > > So: > > What version of Lucene? What version of JFlex? Are you using the Lucene > build system, or some other mechanism to generate the scanner? (If so, > what is it?) > > What other changes have you made? (If you send me your grammar, I'll test > it locally.) > > Can you give an example of an input that should be split but isn't? > > Are you sure you're testing the scanner generated from the modified grammar? > > > > On Mon, Feb 17, 2014 at 5:04 PM, Diego Fernandez <difer...@redhat.com>wrote: > > > Hey Steve, thanks for the quick reply. I didn't have a chance to test > > again until today. In our Lucene build, we had already made some > > customization to the JFlex file and it re-generates the java file whenever > > we build our project. Unfortunately, it is still not working for me. I > > diffed the generated java file before and after the JFlex change and here's > > the result: > > > > > > *** 71,77 **** > > private static final String ZZ_CMAP_PACKED = > > "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+ > > ! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+ > > "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+ > > "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+ > > "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+ > > --- 71,77 ---- > > */ > > private static final String ZZ_CMAP_PACKED = > > "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+ > > ! "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+ > > "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+ > > "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+ > > "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+ > > > > > > Diego Fernandez - 爱国 > > Software Engineer > > US GSS Supportability - Diagnostics > > > > > > ----- Original Message ----- > > > Welcome Diego, > > > > > > I think you’re right about MidLetter - adding a char to it should disable > > > splitting on that char, as long as there is a letter on one side or the > > > other. (If you’d like that behavior to be extended to numeric digits, > > you > > > should use MidNumLet instead.) > > > > > > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex > > > (compressed whitespace diff below): > > > > > > -MidLetter = (\p{WB:MidLetter} | {MidLetterSupp}) > > > +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp}) > > > > > > then running ‘ant jflex’ under lucene/analysis/common/, and the following > > > text was split as indicated (I tested by adding the method below to > > > TestStandardAnalyzer.java): > > > > > > public void testMidLetterSlash() throws Exception { > > > BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, > > > new String[]{ "one/two/three", "four" > > }); > > > BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, > > > new String[] { "1", "two", "3" }); > > > } > > > > > > So it works for me - are you regenerating the scanner (‘ant jflex’)? > > > > > > FYI, I found a bug when I was testing the above: “http://example.com” > > is left > > > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ > > and > > > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement > > should > > > instead result in “http://example.com” being split into “http” and > > > “example.com”. Further testing indicates that this is a problem for > > > MidLetter, MidNumLet and MidNum. I’ve filed an issue: > > > <https://issues.apache.org/jira/browse/LUCENE-5447>. > > > > > > Steve > > > > > > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com> > > wrote: > > > > > > > Hi guys, this is my first time posting on the Lucene list, so hello > > > > everyone. > > > > > > > > I really like the way that the StandardTokenizer works, however I'd > > like > > > > for it to not split tokens on / (forward slash). I've been looking at > > > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to > > > > understand the rules, but I'm either misunderstanding or missing > > > > something. If I understand correctly, the symbols in MidLetter keep it > > > > from splitting a token as long as there's alpha chars on either side. > > I > > > > tried adding the forward slash to the MidLetter and MidLetterSupp rules > > > > (tried different combinations), but it still seems like it's splitting > > on > > > > it. > > > > > > > > Does anyone have any tips or ideas? > > > > > > > > Thanks > > > > > > > > Diego Fernandez - 爱国 > > > > Software Engineer > > > > US GSS Supportability - Diagnostics > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org