---------- Forwarded message ---------- From: Deven You <devyo...@gmail.com> Date: 2010/7/8 Subject: Re: A question about android regex implementation To: enh <e...@google.com>
2010/7/8 enh <e...@google.com> On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <jessewil...@google.com> wrote: > > On Mon, Jul 5, 2010 at 8:02 PM, Deven You <devyo...@gmail.com> wrote: > > > I have written some simple benchmarks for harmony regex and find the > > > performance of harmony is poor compared to RI. For example, > Mathcer.find() > > > only reach 60% of that of RI. > > i'd be interested to see your benchmark code. our regular expression > benchmarks actually focus on replacing use of Pattern/Matcher with > simpler fast paths where the full complexity isn't needed. for > example, most String.split calls don't actually use regular > expressions. i'm not sure how much of this work made it into froyo, > and it's certainly still a work in progress. > I first found harmony regex is slower than RI is from an internal benchmark test result. In this benchmark, Mather.find() is relatively hot and slower than RI. Then I write a simple test to verify it: import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherFind { public static void main(String[] args) { Pattern pat = Pattern.compile("(\\d{1,3})"); Matcher matcher = pat.matcher("aaaa123456789045"); //warm up for (int i = 0; i <1000000; i++) { while(matcher.find()) { } matcher.reset(); } long time = System.currentTimeMillis(); for (int i = 0; i <10000000; i++) { while(matcher.find()) { } matcher.reset(); } time = System.currentTimeMillis() - time; System.out.println("total time of MatcherFind is " + time + " ms!"); } The results are harmony takes 9841 -10079 ms meanwhile RI takes 5596 - 5877 ms. In volano benchmark, Matcher(Pattern, CharSequence) and Mather.group(int) are also slower than RI, although they are not hot methods. My simple benchmarks for these 2 methods also approve this. 1. Matcher(Pattern, CharSequence) import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherConstruct { public static void main(String[] args) { Pattern pat = Pattern.compile("XX"); Matcher matcher = null; //warmup phase for (int i = 0; i < 100000; i++) { matcher = pat.matcher("Today is XX-XX-XX ..."); } long time = System.currentTimeMillis(); for (int i = 0; i < 1000000; i++) { matcher = pat.matcher("Today is XX-XX-XX ..."); } time = System.currentTimeMillis() - time; System.out.println("Total time of MatcherConstruct is " + time + " ms!"); } } 2. Matcher.groun(int) import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherGroupInt { public static void main(String[] args) { String testPattern = "(\\d{1,3})"; String testString = "aaaa123456789045"; Pattern pat = Pattern.compile(testPattern); Matcher matcher = pat.matcher(testString); //warm up for (int i = 0; i < 500000; i++) { while (matcher.find()) { String result = matcher.group(1); } matcher.reset(); } long time = System.currentTimeMillis(); for (int i = 0; i < 1000000; i++) { while (matcher.find()) { String result = matcher.group(1); } matcher.reset(); } time = System.currentTimeMillis() - time; System.out.println("total time of MatcherGroupInt is " + time + " ms!"); } } > one thing we want to do is use http://code.google.com/p/caliper/ > benchmarks to track the relative performance of Java implementations > of the stuff that's currently native code (but doesn't inherently need > to be). we don't have anything like that for regular expressions yet. > (our tool for conveniently running benchmarks is > http://code.google.com/p/vogar/ and you can find some of our actual > benchmarks at http://code.google.com/p/dalvik/.) > > > > I heard Android use icu4jni re-implement this > > > module. Since icu4jni use native code I think it may has higher > performance > > > than harmony. I am trying to use icu4jni as the back-end of harmony > regex > > > but find icu4jni has no functions related to regex operations. > > although, as jesse said, regular expressions are a special case where > there was no original icu4jni implementation, icu4jni is a dead > project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the > icu4jni code, but it's been gradually rewritten since then. the > process is incomplete, and the extent of the rewrite varies from class > to class, but you're probably better off taking code from Android > rather than icu4jni. > > > > I know there are some android guys in our community. So can anyone tell > me > > > some detail info for android's regex, like if it re-implement the regex > > > logic using native code by android itself rather than icu4jni and > really get > > > higher performance compared to harmony regex? Thanks a lot! > > > > We actually use icu4c's pattern, which we access via JNI. Documentation > is > > here: > > http://icu-project.org/apiref/icu4c/classRegexPattern.html > > > > Our regex-java integration code is Apache-licensed and available in our > Git > > repository here: > > > > > http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master > > actually, in shipping versions of Android we use the C API rather than > the C++ API. see > > http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master > for the gory details. > > i've recently rewritten this to use the C++ APIs, and it's much > cleaner for it. if you look at the current code you'll see that we > copy the input to the native heap, because icu4c < 4.6 can't cope with > its input moving about between calls (and a VM can't necessarily > guarantee the Java char[] doesn't move around). > > in future, we'll use a new icu4c API to get the best of both worlds: > > > http://sourceforge.net/mailarchive/forum.php?thread_name=aanlktim7cztafnebqm8to3lhddingvs3pshpw79xb...@mail.gmail.com&forum_name=icu-design > > if you're shipping your own copy of icu (rather than using some system > one you have no control over), you might want to backport that change > (which has been submitted). > > one final thing to be aware of is that icu4c's regular expression > implementation isn't 100% compatible with the RI. most obviously, > there's no support for CANON_EQ, but there are a variety of more > subtle incompatibilities too. > > -- > Elliott Hughes - http://who/enh - http://jessies.org/~enh/ >