DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUGĀ· RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=37382>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED ANDĀ· INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=37382 Summary: stack over flow while using a Regex Product: ORO Version: 2.0.7 Platform: Other OS/Version: other Status: NEW Severity: normal Priority: P2 Component: Main AssignedTo: [email protected] ReportedBy: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Hi, I am using ORO Regex API version 2.0.7 and my objective is to extract some tagged data from html source. For example i am interested in getting the source code for all the forms found in a html page. So i made my regex like this: Regex formReg = new Regex("(?i)(<form(.|\\s)*?>(.|\\s)*?</form>)"); because following one didn't work, Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)"); because . is taken as any character but not newline. So my first regex worked well and i was able to get complete form data starting from <form..... to </form> BUT when the form was big say like it had around 400 lines and 30K bytes then it failed and resulted in Stack Overflow. I am pasting below the stack overflow error: Matched <form name="param" action="http://www/parametric/ProductParametric" method="post"> <input name="sterm" type="hidden"> </form> matcher.getMatch().endOffset(1) 4480 Matched <form name="cross" action="http://www/crossref/search.jsp" method="post"> <input name="partNumber" type="hidden"> </form> matcher.getMatch().endOffset(1) 127 java.lang.StackOverflowError at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) Also i am pasting my code(method) which i wrote for extraction and it can be simply called from main method and run, ---------------------------------------------------------------------------- public static void testRegOro() { try { String html = IoUtils.readFile("file.txt"); // String html = "all work and no play makes jack a dull boy"; //IoUtils.readFile("file.txt"); Perl5Compiler compiler=new Perl5Compiler(); Perl5Pattern pattern = (Perl5Pattern) compiler.compile ("(<form(.|\\s)*?>(.|\\s)*?</form>)", Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK); PatternMatcher matcher = new Perl5Matcher(); int i=0; while(matcher.contains(html,pattern) && i++<3) { System.out.println("Matched " + matcher.getMatch().group (1)); System.out.println("matcher.getMatch().endOffset(1) " + matcher.getMatch().endOffset(1)); html = html.substring(matcher.getMatch().endOffset(1)); //System.out.println("html " + html); } } catch (Throwable e) { e.printStackTrace(); } } ------------------------------------------------------------------------------ As my code shows i am reading a file.txt file i am attaching that file also in the bug. I will really appreciate if you can look into this and throw some light on this and if it can be improved? Thanks in Advance! Regards, Pushpesh Kr. Rajwanshi -- Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
