Re: NSRegularExpression segfault

2014-12-15 Thread ecir hana
On Mon, Dec 15, 2014 at 8:04 PM, Jens Alfke  wrote:
>
>
> Anyway, please file a bug report with Apple, as this seems to be a bug in
> NSRegularExpression.
>

Done!

I'm no expert on regular expressions, but my understanding is that all the
> implementations have a couple of pathological conditions where for certain
> expressions the memory usage goes up rapidly with input length, and if you
> run into one of those you'll need to adjust your expression to work around
> it.
>

I'm also no expert on regexps but I thought that the pathological
conditions are met when using back-references (which I don't use). In any
case, I tried it in JavaScript and it works up to 10 000 000 characters (I
didn't try more).

var pattern = /(1+)|(2+)|(3+)|(4+)|(5+)|(6+)|(7+)|(8+)|(9+)|(0+)|(a+)/;
var result = pattern.exec(new Array(1001).join("a"));
console.log(result.index, result[0].length);

Thanks again for all the replies!
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread Jens Alfke

> On Dec 15, 2014, at 8:56 AM, ecir hana  wrote:
> 
> I don't seem to have "bt" but I got instead from "Problem Report" window:

Sounds like you're not running the app in Xcode. If you're hitting a crash or 
other bug you should always try to reproduce it while running the app in Xcode, 
so the debugger can kick in.

Anyway, please file a bug report with Apple, as this seems to be a bug in 
NSRegularExpression.

> I would have thought 50 000 characters is not that much. But then again, I
> don't really know how ICU works... Reading the ICU docs further, there it
> says:
> 
>> Because ICU does not use program recursion to maintain its backtracking
>> state, stack usage during matching operations is minimal, and does not
>> increase with complex patterns or large amounts of backtracking state.

It's not blowing up the stack. But it has to store backtracking state 
_somewhere_, and apparently it has a fixed-size buffer for this, which 
overflows while handling your regex because it's run into some pathological 
condition. Then, presumably, NSRegularExpression isn't noticing the error and 
ends up crashing by trying to operate on the nonexistent output.

I'm no expert on regular expressions, but my understanding is that all the 
implementations have a couple of pathological conditions where for certain 
expressions the memory usage goes up rapidly with input length, and if you run 
into one of those you'll need to adjust your expression to work around it.

—Jens
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread ecir hana
On Mon, Dec 15, 2014 at 6:59 PM, ecir hana  wrote:
>
> It's not precisely 11, but yes, I need to have many.
>

By "many" I mean more than 2 and less than...maybe 10.
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread ecir hana
On Mon, Dec 15, 2014 at 6:50 PM, Stephen J. Butler  wrote:
>
> It seems to be related to the number of capture groups you have. For
> example, this also succeeds for me:
>
>
> ((?:1+)|(?:2+)|(?:3+)|(?:4+)|(?:5+)|(?:6+)|(?:7+)|(?:8+)|(?:9+)|(?:0+)|(?:a+))
>
> Do you really need the 11 capture groups?
>

Thanks for giving it a try!

It's not precisely 11, but yes, I need to have many. I also discovered it
is related to the number of capture groups and their "complexity". One
single group works for much longer inputs but I guess you already know that.

To put it differently, even if I could reduce the number of groups a bit, I
think I will still hit that segfault for some non-trivial inputs. For
example, when I only use 6 groups it still segfaults at around 80 000
characters.

>
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread Stephen J. Butler
It seems to be related to the number of capture groups you have. For
example, this also succeeds for me:

((?:1+)|(?:2+)|(?:3+)|(?:4+)|(?:5+)|(?:6+)|(?:7+)|(?:8+)|(?:9+)|(?:0+)|(?:a+))

Do you really need the 11 capture groups?

On Mon, Dec 15, 2014 at 11:20 AM, ecir hana  wrote:
>
>
>
> On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler <
> stephen.but...@gmail.com> wrote:
>>
>> If you read the ICU docs on regular expressions you'll see that it sets
>> an 8MB limit on head size when evaluating. My guess is that you've run into
>> this and NSRegularExpression misses a return code somewhere.
>>
>
> I would have thought 50 000 characters is not that much. But then again, I
> don't really know how ICU works... Reading the ICU docs further, there it
> says:
>
> Because ICU does not use program recursion to maintain its backtracking
>> state, stack usage during matching operations is minimal, and does not
>> increase with complex patterns or large amounts of backtracking state.
>
>
>
>> But your pattern is really suboptimal for what you're trying to
>> accomplish.
>>
>
> It's really a reduced test case.
>
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread Maxthon Chan
I ran into the same problem trying to write a parser of a markup language. I 
would suggest you to use some mechanism to break this long input to smaller 
chunks to match. Reggie is only good for lexical analysis which only deal with 
smaller chunks, bigger blocks need to be handled using grammatical analysis.

> On Dec 16, 2014, at 01:20, ecir hana  wrote:
> 
> On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler > wrote:
>> 
>> If you read the ICU docs on regular expressions you'll see that it sets an
>> 8MB limit on head size when evaluating. My guess is that you've run into
>> this and NSRegularExpression misses a return code somewhere.
>> 
> 
> I would have thought 50 000 characters is not that much. But then again, I
> don't really know how ICU works... Reading the ICU docs further, there it
> says:
> 
> Because ICU does not use program recursion to maintain its backtracking
>> state, stack usage during matching operations is minimal, and does not
>> increase with complex patterns or large amounts of backtracking state.
> 
> 
> 
>> But your pattern is really suboptimal for what you're trying to accomplish.
>> 
> 
> It's really a reduced test case.
> ___
> 
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
> 
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
> 
> Help/Unsubscribe/Update your Subscription:
> https://lists.apple.com/mailman/options/cocoa-dev/max%40maxchan.info
> 
> This email sent to m...@maxchan.info



smime.p7s
Description: S/MIME cryptographic signature
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread ecir hana
On Mon, Dec 15, 2014 at 6:09 PM, Stephen J. Butler  wrote:
>
> If you read the ICU docs on regular expressions you'll see that it sets an
> 8MB limit on head size when evaluating. My guess is that you've run into
> this and NSRegularExpression misses a return code somewhere.
>

I would have thought 50 000 characters is not that much. But then again, I
don't really know how ICU works... Reading the ICU docs further, there it
says:

Because ICU does not use program recursion to maintain its backtracking
> state, stack usage during matching operations is minimal, and does not
> increase with complex patterns or large amounts of backtracking state.



> But your pattern is really suboptimal for what you're trying to accomplish.
>

It's really a reduced test case.
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread Stephen J. Butler
If you read the ICU docs on regular expressions you'll see that it sets an
8MB limit on head size when evaluating. My guess is that you've run into
this and NSRegularExpression misses a return code somewhere.

But your pattern is really suboptimal for what you're trying to accomplish.
For example, this pattern is functionally equivalent and doesn't cause a
crash for me:

@"(([0-9a])\\2*)"

That matches a single character, and then the same character 0 or more
times.

On Mon, Dec 15, 2014 at 10:38 AM, ecir hana  wrote:
>
> Hi!
>
> I recently needed to match some patterns but I encountered a problematic
> situation.
>
> Please, can anyone explain to me why does the following program
> consistently segfault after 5 characters? I'm running 10.9.5...
>
> #import 
>
> int main () {
> NSString *pattern =
> @"(1+)|(2+)|(3+)|(4+)|(5+)|(6+)|(7+)|(8+)|(9+)|(0+)|(a+)";
> NSRegularExpression *expression = [NSRegularExpression
> regularExpressionWithPattern:pattern options:0 error:nil];
> for (NSUInteger i = 0; i < 10; i += 1) {
> NSString *string = [@"" stringByPaddingToLength:i withString:@
> "a"
> startingAtIndex:0];
> NSTextCheckingResult *result = [expression
> firstMatchInString:string options:0 range:NSMakeRange(0, i)];
> NSLog(@"%@", NSStringFromRange([result range]));
> }
> return 0;
> }
>
> It says:
>
> {0, 0}
> {0, 1}
> {0, 2}
> {0, 3}
> {0, 4}
> {0, 5}
> Segmentation fault: 11
>
> Thanks in advance!
> ___
>
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
>
> https://lists.apple.com/mailman/options/cocoa-dev/stephen.butler%40gmail.com
>
> This email sent to stephen.but...@gmail.com
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread ecir hana
On Mon, Dec 15, 2014 at 5:44 PM, Jens Alfke  wrote:
>
>
> What's the backtrace? Whenever investigating a crash you should always
> start by looking at the stack. (To get this in a text form you can paste
> into a message, use the debugger command "bt".)
>

I don't seem to have "bt" but I got instead from "Problem Report" window:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libicucore.A.dylib0x7fff8f301ac1 utext_clone + 22
1   libicucore.A.dylib0x7fff8f302570
icu::RegexMatcher::reset(UText*) + 54
2   libicucore.A.dylib0x7fff8f305f01 uregex_setText + 223
3   com.apple.Foundation  0x7fff8aba343f
-[NSRegularExpression(NSMatching)
enumerateMatchesInString:options:range:usingBlock:] + 938
4   com.apple.Foundation  0x7fff8ac13ba4
-[NSRegularExpression(NSMatching) firstMatchInString:options:range:] + 145
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSRegularExpression segfault

2014-12-15 Thread Jens Alfke

> On Dec 15, 2014, at 8:38 AM, ecir hana  wrote:
> 
> Please, can anyone explain to me why does the following program
> consistently segfault after 5 characters? I'm running 10.9.5…

What's the backtrace? Whenever investigating a crash you should always start by 
looking at the stack. (To get this in a text form you can paste into a message, 
use the debugger command "bt".)

My first guess was that you've hit one of those cases where certain regular 
expressions can take huge amounts of memory to match on long strings, but as 
you're running on OS X you've presumably got a 64-bit process, so that 
shouldn't be an issue.

—Jens
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com