Re: NSRegularExpression segfault
On Mon, Dec 15, 2014 at 8:04 PM, Jens Alfke wrote: > > > Anyway, please file a bug report with Apple, as this seems to be a bug in > NSRegularExpression. > Done! I'm no expert on regular expressions, but my understanding is that all the > implementations have a couple of pathological conditions where for certain > expressions the memory usage goes up rapidly with input length, and if you > run into one of those you'll need to adjust your expression to work around > it. > I'm also no expert on regexps but I thought that the pathological conditions are met when using back-references (which I don't use). In any case, I tried it in JavaScript and it works up to 10 000 000 characters (I didn't try more). var pattern = /(1+)|(2+)|(3+)|(4+)|(5+)|(6+)|(7+)|(8+)|(9+)|(0+)|(a+)/; var result = pattern.exec(new Array(1001).join("a")); console.log(result.index, result[0].length); Thanks again for all the replies! ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
> On Dec 15, 2014, at 8:56 AM, ecir hana wrote: > > I don't seem to have "bt" but I got instead from "Problem Report" window: Sounds like you're not running the app in Xcode. If you're hitting a crash or other bug you should always try to reproduce it while running the app in Xcode, so the debugger can kick in. Anyway, please file a bug report with Apple, as this seems to be a bug in NSRegularExpression. > I would have thought 50 000 characters is not that much. But then again, I > don't really know how ICU works... Reading the ICU docs further, there it > says: > >> Because ICU does not use program recursion to maintain its backtracking >> state, stack usage during matching operations is minimal, and does not >> increase with complex patterns or large amounts of backtracking state. It's not blowing up the stack. But it has to store backtracking state _somewhere_, and apparently it has a fixed-size buffer for this, which overflows while handling your regex because it's run into some pathological condition. Then, presumably, NSRegularExpression isn't noticing the error and ends up crashing by trying to operate on the nonexistent output. I'm no expert on regular expressions, but my understanding is that all the implementations have a couple of pathological conditions where for certain expressions the memory usage goes up rapidly with input length, and if you run into one of those you'll need to adjust your expression to work around it. —Jens ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
On Mon, Dec 15, 2014 at 6:59 PM, ecir hana wrote: > > It's not precisely 11, but yes, I need to have many. > By "many" I mean more than 2 and less than...maybe 10. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
On Mon, Dec 15, 2014 at 6:50 PM, Stephen J. Butler wrote: > > It seems to be related to the number of capture groups you have. For > example, this also succeeds for me: > > > ((?:1+)|(?:2+)|(?:3+)|(?:4+)|(?:5+)|(?:6+)|(?:7+)|(?:8+)|(?:9+)|(?:0+)|(?:a+)) > > Do you really need the 11 capture groups? > Thanks for giving it a try! It's not precisely 11, but yes, I need to have many. I also discovered it is related to the number of capture groups and their "complexity". One single group works for much longer inputs but I guess you already know that. To put it differently, even if I could reduce the number of groups a bit, I think I will still hit that segfault for some non-trivial inputs. For example, when I only use 6 groups it still segfaults at around 80 000 characters. > ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
It seems to be related to the number of capture groups you have. For example, this also succeeds for me: ((?:1+)|(?:2+)|(?:3+)|(?:4+)|(?:5+)|(?:6+)|(?:7+)|(?:8+)|(?:9+)|(?:0+)|(?:a+)) Do you really need the 11 capture groups? On Mon, Dec 15, 2014 at 11:20 AM, ecir hana wrote: > > > > On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler < > stephen.but...@gmail.com> wrote: >> >> If you read the ICU docs on regular expressions you'll see that it sets >> an 8MB limit on head size when evaluating. My guess is that you've run into >> this and NSRegularExpression misses a return code somewhere. >> > > I would have thought 50 000 characters is not that much. But then again, I > don't really know how ICU works... Reading the ICU docs further, there it > says: > > Because ICU does not use program recursion to maintain its backtracking >> state, stack usage during matching operations is minimal, and does not >> increase with complex patterns or large amounts of backtracking state. > > > >> But your pattern is really suboptimal for what you're trying to >> accomplish. >> > > It's really a reduced test case. > ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
I ran into the same problem trying to write a parser of a markup language. I would suggest you to use some mechanism to break this long input to smaller chunks to match. Reggie is only good for lexical analysis which only deal with smaller chunks, bigger blocks need to be handled using grammatical analysis. > On Dec 16, 2014, at 01:20, ecir hana wrote: > > On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler > wrote: >> >> If you read the ICU docs on regular expressions you'll see that it sets an >> 8MB limit on head size when evaluating. My guess is that you've run into >> this and NSRegularExpression misses a return code somewhere. >> > > I would have thought 50 000 characters is not that much. But then again, I > don't really know how ICU works... Reading the ICU docs further, there it > says: > > Because ICU does not use program recursion to maintain its backtracking >> state, stack usage during matching operations is minimal, and does not >> increase with complex patterns or large amounts of backtracking state. > > > >> But your pattern is really suboptimal for what you're trying to accomplish. >> > > It's really a reduced test case. > ___ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > https://lists.apple.com/mailman/options/cocoa-dev/max%40maxchan.info > > This email sent to m...@maxchan.info smime.p7s Description: S/MIME cryptographic signature ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler wrote: > > If you read the ICU docs on regular expressions you'll see that it sets an > 8MB limit on head size when evaluating. My guess is that you've run into > this and NSRegularExpression misses a return code somewhere. > I would have thought 50 000 characters is not that much. But then again, I don't really know how ICU works... Reading the ICU docs further, there it says: Because ICU does not use program recursion to maintain its backtracking > state, stack usage during matching operations is minimal, and does not > increase with complex patterns or large amounts of backtracking state. > But your pattern is really suboptimal for what you're trying to accomplish. > It's really a reduced test case. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
If you read the ICU docs on regular expressions you'll see that it sets an 8MB limit on head size when evaluating. My guess is that you've run into this and NSRegularExpression misses a return code somewhere. But your pattern is really suboptimal for what you're trying to accomplish. For example, this pattern is functionally equivalent and doesn't cause a crash for me: @"(([0-9a])\\2*)" That matches a single character, and then the same character 0 or more times. On Mon, Dec 15, 2014 at 10:38 AM, ecir hana wrote: > > Hi! > > I recently needed to match some patterns but I encountered a problematic > situation. > > Please, can anyone explain to me why does the following program > consistently segfault after 5 characters? I'm running 10.9.5... > > #import > > int main () { > NSString *pattern = > @"(1+)|(2+)|(3+)|(4+)|(5+)|(6+)|(7+)|(8+)|(9+)|(0+)|(a+)"; > NSRegularExpression *expression = [NSRegularExpression > regularExpressionWithPattern:pattern options:0 error:nil]; > for (NSUInteger i = 0; i < 10; i += 1) { > NSString *string = [@"" stringByPaddingToLength:i withString:@ > "a" > startingAtIndex:0]; > NSTextCheckingResult *result = [expression > firstMatchInString:string options:0 range:NSMakeRange(0, i)]; > NSLog(@"%@", NSStringFromRange([result range])); > } > return 0; > } > > It says: > > {0, 0} > {0, 1} > {0, 2} > {0, 3} > {0, 4} > {0, 5} > Segmentation fault: 11 > > Thanks in advance! > ___ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > > https://lists.apple.com/mailman/options/cocoa-dev/stephen.butler%40gmail.com > > This email sent to stephen.but...@gmail.com ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
On Mon, Dec 15, 2014 at 5:44 PM, Jens Alfke wrote: > > > What's the backtrace? Whenever investigating a crash you should always > start by looking at the stack. (To get this in a text form you can paste > into a message, use the debugger command "bt".) > I don't seem to have "bt" but I got instead from "Problem Report" window: Thread 0 Crashed:: Dispatch queue: com.apple.main-thread 0 libicucore.A.dylib0x7fff8f301ac1 utext_clone + 22 1 libicucore.A.dylib0x7fff8f302570 icu::RegexMatcher::reset(UText*) + 54 2 libicucore.A.dylib0x7fff8f305f01 uregex_setText + 223 3 com.apple.Foundation 0x7fff8aba343f -[NSRegularExpression(NSMatching) enumerateMatchesInString:options:range:usingBlock:] + 938 4 com.apple.Foundation 0x7fff8ac13ba4 -[NSRegularExpression(NSMatching) firstMatchInString:options:range:] + 145 ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: NSRegularExpression segfault
> On Dec 15, 2014, at 8:38 AM, ecir hana wrote: > > Please, can anyone explain to me why does the following program > consistently segfault after 5 characters? I'm running 10.9.5… What's the backtrace? Whenever investigating a crash you should always start by looking at the stack. (To get this in a text form you can paste into a message, use the debugger command "bt".) My first guess was that you've hit one of those cases where certain regular expressions can take huge amounts of memory to match on long strings, but as you're running on OS X you've presumably got a 64-bit process, so that shouldn't be an issue. —Jens ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com