Hi Daniel, I am developing a project which requires searching an http response for some user entered strings. Given that the size of the response can be very big, I need an efficient pattern matching mechanism. Using AwkStreamInput with AwkMatcher works extremely well for me except that it does not support Unicode. When do you intend providing unicode support for it?
I also tried MatchActionProcessor, but am unable to get match results for unicode strings. In a demo program, I try to search a string present in 'unicoderegex.txt' in a 'unicode.html' file. Both files are stored in "utf-8" encoding. The unicode.html file contains a mix of english and japanese chars. I changed the contents of unicoderegex.txt from plain English text, to mix content, to only Jap content. I also tried specifying content encoding while creating the InputStreamReader, but did not succeed in finding matches. I also tried with commenting and uncommenting the line regex = getStringAsCodes(regex); However, when I init regex as 'regex = "Unicode";' then I am able to find matches for the string 'Unicode'. I went through the mailing lists but could not find any examples. Could you please tell me if I am doing anything wrong here, or is that ORO does not support unicode at all? Or do I have to set some flags to enable unicode? Another concern is that : is multiline matching possible with MatchActionProcessor? Below is the source code for my demo program. For additional info, I used jdk13 and jakarta-oro-2.0.7-dev-1.jar. Any help is greatly appreciated. Thanks a lot, Aarti H. ===================================== CODE ================================= import java.io.*; import org.apache.oro.text.*; import org.apache.oro.text.regex.*; public final class UnicodeDemo { public static final void main(String[] args) throws Exception { //init the regex FileInputStream fis = new FileInputStream("C:\\unicoderegex.txt"); BufferedReader bf = new BufferedReader(new InputStreamReader(fis/*, "UTF-8"*/)); String regex = bf.readLine(); regex = getStringAsCodes(regex); //regex = "Unicode"; System.out.println("regex = "+regex); bf.close(); MatchActionProcessor processor = new MatchActionProcessor(); processor.addAction(regex, new MatchAction() { //if a match is found, show it on console. public void processMatch(MatchActionInfo info) { info.output.println("match found = " + info.line); } }); processor.processMatches(new FileInputStream("c:\\unicode.html"), System.out); } /** * takes a string which may contain unicode chars and returns a string * with the unicode chars replaced by their unicode codes. * Example return value: "\u00ef\u00bb\u00bf\u00e6\u2014" */ private static String getStringAsCodes(String sName) { if (sName == null || sName.trim().length() == 0) { return sName; } final char [] chArray = sName.toCharArray(); String sReturnName = ""; for (int i = 0; i < chArray.length; i ++) { if (Character.UnicodeBlock.of(chArray[i]) != Character.UnicodeBlock.BASIC_LATIN) { sReturnName += getUnicodeRepresentationOfChar(chArray[i]); } else { char cc [] = {chArray[i]}; sReturnName += new String(cc); } } return sReturnName; } private static String getUnicodeRepresentationOfChar(char ch) { String s = Integer.toHexString(ch); final int iLen = s.length(); if (iLen < 4) { for (int i = 0; i < 4 - iLen; i ++) { s = "0" + s; } } return "\\u" + s; } } ===================================== CODE ================================= __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>