Re: How to print all expressions that match a regular expression

2010-02-08 Thread Gabriel Genellina

En Mon, 08 Feb 2010 02:17:59 -0300, hzh...@gmail.com hzh...@gmail.com
escribió:

Please check out this example on the pyparsing wiki,  
invRegex.py:http://pyparsing.wikispaces.com/file/view/invRegex.py.  
 This code

implements a generator that returns successive matching strings for
the given regex.  [...]
Of course, as other posters have pointed out, this inverter does not
accept regexen with unbounded multiple characters '+' or '*', but '?'
and {min,max} notation will work.  Even '.' is supported, although
this can generate a large number of return values.


Thanks very much. This is exactly what I need now. I will check this
function.



Here you have another approach based on [1]. This is a generator-based
approach, yielding all strings in increasing length order. In principle it
can handle unbounded repetitions, except as written the maximum recursion
limit is shortly reached (the original code is in Haskell, I almost
blindly translated it into Python; certainly it can be rewritten more
efficiently)

You have to parse the R.E. and generate the corresponding function calls
to the merge/prod/closure functions -- pyparsing certainly can help with
that. ab becomes prod(a,b), a|b becomes merge(a,b), and a* becomes
closure(a)

By example, to find the language defined by this expression (a|bc)*d,
one has to evaluate:

prod(
closure(
  merge(['a'],
prod(['b'],['c']))),
['d']
)

wich yields these strings:

d
ad
aad
bcd
aaad
abcd
bcad
...

bcbcbcbcaad
bcbcbcbcbcd
aaad

and after 234 results aborts with a recursion error :(

[1] http://www.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf

--
Gabriel Genellina

enumerate_regular_language.py
Description: Binary data
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-08 Thread Gabriel Genellina

En Mon, 08 Feb 2010 02:17:59 -0300, hzh...@gmail.com hzh...@gmail.com
escribió:

Please check out this example on the pyparsing wiki,  
invRegex.py:http://pyparsing.wikispaces.com/file/view/invRegex.py.  
 This code

implements a generator that returns successive matching strings for
the given regex.  [...]
Of course, as other posters have pointed out, this inverter does not
accept regexen with unbounded multiple characters '+' or '*', but '?'
and {min,max} notation will work.  Even '.' is supported, although
this can generate a large number of return values.


Thanks very much. This is exactly what I need now. I will check this
function.



Here you have another approach based on [1]. This is a generator-based
approach, yielding all strings in increasing length order. In principle it
can handle unbounded repetitions, except as written the maximum recursion
limit is shortly reached (the original code is in Haskell, I almost
blindly translated it into Python; certainly it can be rewritten more
efficiently)

You have to parse the R.E. and generate the corresponding function calls
to the merge/prod/closure functions -- pyparsing certainly can help with
that. ab becomes prod(a,b), a|b becomes merge(a,b), and a* becomes
closure(a)

By example, to find the language defined by this expression (a|bc)*d,
one has to evaluate:

prod(
closure(
  merge(['a'],
prod(['b'],['c']))),
['d']
)

wich yields these strings:

d
ad
aad
bcd
aaad
abcd
bcad
...

bcbcbcbcaad
bcbcbcbcbcd
aaad

and after 234 results aborts with a recursion error :(

[1] http://www.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf

--
Gabriel Genellina

enumerate_regular_language.py
Description: Binary data
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Steve Holden
hzh...@gmail.com wrote:
 So it seems we both misunderstood the problem.

 I didn't read the top level article until now, and reading it, I can't make
 sense of it.

 
 Seems that you should read the whole thing before making a post, or
 else you cannot know what we are talking about.
 Steven doesn't misunderstand me. We are talking about what I need, and
 he tries to help.
 
 
 
 Given the function hashlib.sha256, enumerate all the possible inputs
 that give the hexadecimal result
 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.
 I tried some parrot variants but no dice. :-(

 [snip]

 
 This is a hash collision problem. Nobody has proved that SHA-256 is
 collision free, even not in the random oracle model, because people
 always suppose that a random oracle exists, and make hash function its
 substitution. That means it may be broken someday. And any provable
 security based on random oracle model is not secure.
 
It's very easy to prove that no hash function is collision-free, since
the domain (all possible inputs) is much larger than the range (all
possible outputs). Hence there must be many inputs that map to the same
output.

A *good* hash function is unpredictable enough to make finding two
colliding strings impractical - and even the best hash functions that
cryptographers could devise at the time have been broken. We should
remember that broken to a cryptographer means something rather
different than it does in common usage, so a broken scheme need not
necessarily be dropped immediately - one would just stop using it in new
systems.

 
 I'm suggesting that, in general, there's no way to tell in advance which
 regexes will be easy and which will be hard, and even when they are easy,
 the enumeration will often be infinite.
 
 It is hard to tell in advance. However, we can add some timing limit
 or counting limit, to make it an algorithm, which can halt. For
 example, whenever the program outputs more than 100 expressions
 that match the input regex, we can halt because that exceeds our
 limit. But surely this is not efficient because of the post-decision.
 
 Essentially, any regexp that includes '+' or '*' (directly or via e.g. 
 notation
 that denotes digit sequence) yields an infinite number of strings.
 
 Infinity is really relative, not absolute. It is relative to the
 computing speed. For example, the regex '^[0|1]{2048}$' is rather
 simple and doesn't contain '+' or '$', but trying to output all
 expressions that match it has a complexity of 2^2048. If we can do
 that, then we can break RSA-2048.
 We must face the reality .
 
I have always understood that there's a pretty real distinction between
finite and infinite. Are you telling me I am wrong, or are you
merely saying that some finite cases might just as well be infinite for
practical purposes?

And I really don't see how simple enumeration of range(2^2048) breaks
RSA-2048, since that problem requires you to find two factors which,
when multiplied together, give that specific value.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Tim Chase

hzh...@gmail.com wrote:

Given the function hashlib.sha256, enumerate all the possible inputs
that give the hexadecimal result
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.


This is a hash collision problem. Nobody has proved that SHA-256 is
collision free


It's actually pretty easy to prove that it is *not* collision 
free.  The SHA-256 encodes 512 bits of data.  So the the process 
of encoding (2**512)+1 distinct inputs incurs a collision in 
SHA-256 space as soon as you've hit (2**512)+1 if not earlier.


to start you off:

  sha_backmap = {}
  for i in xrange((2**512)+2):
hash = sha(str(i))
if hash in sha_backmap:
  print Collision found: %i and %i % (
i, sha_backmap[hash])
break
sha_backmap[hash] = i

Though it might take a computer the size of the universe, so I'm 
guessing that the first collision encountered is with 42.  I 
leave the actual calculation and hashing of all possible 
combinations of 513 bits of data as an exercise to the reader 
with a lot of time on their hands or a quantum computer under 
their desk ;-)



It is hard to tell in advance. However, we can add some timing limit
or counting limit, to make it an algorithm, which can halt. For
example, whenever the program outputs more than 100 expressions
that match the input regex, we can halt because that exceeds our
limit. But surely this is not efficient because of the post-decision.


As mentioned, it sounds like you either want a depth-first of the 
solution space that raises exceptions on an infinite/unbounded 
operator (*, +, and {N,} as mentioned in another email), or 
if you want to handle those operators, do a breadth-first search 
of the solution-space and track your depth (or time taken, or 
previous number of multi-factor atoms if you desire) to ensure 
you don't exceed a certain depth.  But you're still talking a 
combinatorial number of solutions for even simple regexps.


-tkc


--
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread hzh...@gmail.com


 And I really don't see how simple enumeration of range(2^2048) breaks
 RSA-2048, since that problem requires you to find two factors which,
 when multiplied together, give that specific value.


I can tell you why is that. RSA-2048 has a composite of length less
than 2^2048, which is a product of two large primes. So one of its
factors cannot exceed 2^2047, and we can treat the multiplication as a
computation with constant complexity. So the time complexity of
enumerating 2^2048 strings is the same with factoring a composite with
length 2^2048 which is the product of two primes.

And obviously, whenever we successfully factor the composite, we can
calculate the Euler function of it. So that given any public key
(n,e), calculating the private key (n,d) is easy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Steve Holden
hzh...@gmail.com wrote:
 And I really don't see how simple enumeration of range(2^2048) breaks
 RSA-2048, since that problem requires you to find two factors which,
 when multiplied together, give that specific value.

 
 I can tell you why is that. RSA-2048 has a composite of length less
 than 2^2048, which is a product of two large primes. So one of its
 factors cannot exceed 2^2047, and we can treat the multiplication as a
 computation with constant complexity. So the time complexity of
 enumerating 2^2048 strings is the same with factoring a composite with
 length 2^2048 which is the product of two primes.
 
 And obviously, whenever we successfully factor the composite, we can
 calculate the Euler function of it. So that given any public key
 (n,e), calculating the private key (n,d) is easy.
 
So all I have to do to break RSA is to count to 2^2048?

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread hzh...@gmail.com
That is a method called brute force. According to my computation,
2^2048=
32317006071311007300714876688669951960444102669715484032130345427524655138867890
89319720141152291346368871796092189801949411955915049092109508815238644828312063
08773673009960917501977503896521067960576383840675682767922186426197561618380943
3847617047058164585203630504288757589154106580860755239912393038552191489668
34242068497478656456949485617603532632205807780565933102619270846031415025859286
41771167259436037184618573575983511523016459044036976132332872312271256847108202
09725157101726931323469678542580656697935045997268352998638215525166389437335543
602135433229604645318478604952148193555853611059596230656L

which is a very large number.

There are some other algorithms for factoring integers, including
Generalized number field sieve. And in quantum computing, there is an
algorithm called Shor, which is claimed to be a polynomial algorithm
if run under quantum computers. But seems that kind of computers
haven't been successfully built, or else RSA and many other security
mechanisms based on computation complexity cannot be used any longer.

What I need in my application is just to list all expressions that
match a particular regex, which I believe will be more efficient to
deal with if there is a general function for this purpose.
Unfortunately there is not such a function, so I will write my own
function to deal with my particular regex, which can be enumerated.

Sincerely,

Zhuo

On Feb 7, 10:38 am, Steve Holden st...@holdenweb.com wrote:
 hzh...@gmail.com wrote:
  And I really don't see how simple enumeration of range(2^2048) breaks
  RSA-2048, since that problem requires you to find two factors which,
  when multiplied together, give that specific value.

  I can tell you why is that. RSA-2048 has a composite of length less
  than 2^2048, which is a product of two large primes. So one of its
  factors cannot exceed 2^2047, and we can treat the multiplication as a
  computation with constant complexity. So the time complexity of
  enumerating 2^2048 strings is the same with factoring a composite with
  length 2^2048 which is the product of two primes.

  And obviously, whenever we successfully factor the composite, we can
  calculate the Euler function of it. So that given any public key
  (n,e), calculating the private key (n,d) is easy.

 So all I have to do to break RSA is to count to 2^2048?

 regards
  Steve
 --
 Steve Holden           +1 571 484 6266   +1 800 494 3119
 PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
 Holden Web LLC                http://www.holdenweb.com/
 UPCOMING EVENTS:        http://holdenweb.eventbrite.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Steven D'Aprano
On Sun, 07 Feb 2010 03:53:49 +0100, Alf P. Steinbach wrote:

 Given the function hashlib.sha256, enumerate all the possible inputs
 that give the hexadecimal result
 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.
 
 I tried some parrot variants but no dice. :-(

Oh, everybody expects parrots! That's not unexpected -- as a clue, I 
wrote that the message is predictable for being totally unexpected.

The input was Nobody expects the Spanish Inquisition!, which is another 
Monty Python catchphrase.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Steve Holden
Steven D'Aprano wrote:
 On Sun, 07 Feb 2010 03:53:49 +0100, Alf P. Steinbach wrote:
 
 Given the function hashlib.sha256, enumerate all the possible inputs
 that give the hexadecimal result
 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.
 I tried some parrot variants but no dice. :-(
 
 Oh, everybody expects parrots! That's not unexpected -- as a clue, I 
 wrote that the message is predictable for being totally unexpected.
 
 The input was Nobody expects the Spanish Inquisition!, which is another 
 Monty Python catchphrase.
 
 
Bugger - Got everything except the trailing exclamation mark ...

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Steven D'Aprano
On Sun, 07 Feb 2010 20:19:53 -0500, Steve Holden wrote:

 Steven D'Aprano wrote:
 On Sun, 07 Feb 2010 03:53:49 +0100, Alf P. Steinbach wrote:
 
 Given the function hashlib.sha256, enumerate all the possible inputs
 that give the hexadecimal result
 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.
 I tried some parrot variants but no dice. :-(
 
 Oh, everybody expects parrots! That's not unexpected -- as a clue, I
 wrote that the message is predictable for being totally unexpected.
 
 The input was Nobody expects the Spanish Inquisition!, which is
 another Monty Python catchphrase.
 
 
 Bugger - Got everything except the trailing exclamation mark ...


NOBODY EXPECTS THE TRAILING EXCLAMATION MARK!!!




-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread Paul McGuire
On Feb 6, 1:36 pm, hzh...@gmail.com hzh...@gmail.com wrote:
 Hi,

 I am a fresh man with python. I know there is regular expressions in
 Python. What I need is that given a particular regular expression,
 output all the matches. For example, given “[1|2|3]{2}” as the regular
 expression, the program should output all 9 matches, i.e., 11 12 13
 21 22 23 31 32 33.

 Is there any well-written routine in Python or third-party program to
 do this? If there isn't, could somebody make some suggestions on how
 to write it myself?

 Thanks.

 Zhuo

Please check out this example on the pyparsing wiki, invRegex.py:
http://pyparsing.wikispaces.com/file/view/invRegex.py.  This code
implements a generator that returns successive matching strings for
the given regex.  Running it, I see that you actually have a typo in
your example.

 print list(invert([1|2|3]{2}))
['11', '1|', '12', '13', '|1', '||', '|2', '|3', '21', '2|', '22',
'23', '31', '3|', '32', '33']

I think you meant either [123]{2} or (1|2|3){2}.

 print list(invert([123]{2}))
['11', '12', '13', '21', '22', '23', '31', '32', '33']

 print list(invert((1|2|3){2}))
['11', '12', '13', '21', '22', '23', '31', '32', '33']

Of course, as other posters have pointed out, this inverter does not
accept regexen with unbounded multiple characters '+' or '*', but '?'
and {min,max} notation will work.  Even '.' is supported, although
this can generate a large number of return values.

Of course, you'll also have to install pyparsing to get this to work.

-- Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-07 Thread hzh...@gmail.com


 Please check out this example on the pyparsing wiki, 
 invRegex.py:http://pyparsing.wikispaces.com/file/view/invRegex.py.  This code
 implements a generator that returns successive matching strings for
 the given regex.  Running it, I see that you actually have a typo in
 your example.

  print list(invert([1|2|3]{2}))

 ['11', '1|', '12', '13', '|1', '||', '|2', '|3', '21', '2|', '22',
 '23', '31', '3|', '32', '33']

 I think you meant either [123]{2} or (1|2|3){2}.

  print list(invert([123]{2}))

 ['11', '12', '13', '21', '22', '23', '31', '32', '33']

  print list(invert((1|2|3){2}))

 ['11', '12', '13', '21', '22', '23', '31', '32', '33']

 Of course, as other posters have pointed out, this inverter does not
 accept regexen with unbounded multiple characters '+' or '*', but '?'
 and {min,max} notation will work.  Even '.' is supported, although
 this can generate a large number of return values.

 Of course, you'll also have to install pyparsing to get this to work.

 -- Paul


Hi Paul,

Thanks very much. This is exactly what I need now. I will check this
function.

Zhuo

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Roy Smith
In article 
ee2cfd35-3171-4ee7-ad3a-cf117e552...@r24g2000yqd.googlegroups.com,
 hzh...@gmail.com hzh...@gmail.com wrote:

 Hi,
 
 I am a fresh man with python. I know there is regular expressions in
 Python. What I need is that given a particular regular expression,
 output all the matches. For example, given ³[1|2|3]{2}² as the regular
 expression, the program should output all 9 matches, i.e., 11 12 13
 21 22 23 31 32 33.
 
 Is there any well-written routine in Python or third-party program to
 do this? If there isn't, could somebody make some suggestions on how
 to write it myself?
 
 Thanks.
 
 Zhuo

Please enumerate all the strings which match .*.  Use additional sheets 
of paper if needed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Ben Finney
Roy Smith r...@panix.com writes:

  hzh...@gmail.com hzh...@gmail.com wrote:
  What I need is that given a particular regular expression, output
  all the matches.
[…]

 Please enumerate all the strings which match .*. Use additional
 sheets of paper if needed.

+1 QOTW

-- 
 \ “Are you pondering what I'm pondering?” “I think so, ... Brain, |
  `\but how can we get seven dwarves to shave their legs?” —_Pinky |
_o__)   and The Brain_ |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread hzh...@gmail.com
Thanks for your reply.
So there isn't such a routine just because some of the regular
expressions cannot be enumerated. However, some of them can be
enumerated. I guess I have to write a function myself.

Zhuo

On Feb 6, 5:23 pm, Roy Smith r...@panix.com wrote:
 In article
 ee2cfd35-3171-4ee7-ad3a-cf117e552...@r24g2000yqd.googlegroups.com,





  hzh...@gmail.com hzh...@gmail.com wrote:
  Hi,

  I am a fresh man with python. I know there is regular expressions in
  Python. What I need is that given a particular regular expression,
  output all the matches. For example, given ³[1|2|3]{2}² as the regular
  expression, the program should output all 9 matches, i.e., 11 12 13
  21 22 23 31 32 33.

  Is there any well-written routine in Python or third-party program to
  do this? If there isn't, could somebody make some suggestions on how
  to write it myself?

  Thanks.

  Zhuo

 Please enumerate all the strings which match .*.  Use additional sheets
 of paper if needed.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Steven D'Aprano
On Sat, 06 Feb 2010 16:05:15 -0800, hzh...@gmail.com wrote:

 Thanks for your reply.
 So there isn't such a routine just because some of the regular
 expressions cannot be enumerated. However, some of them can be
 enumerated. I guess I have to write a function myself.

How do you expect to tell the ones that can be enumerated apart from 
those that can't be?

Regular expressions are programs in a regex programming language. What 
you are asking for is the same as saying:

Is there a program that can enumerate every possible set of data that is 
usable as valid input for a given program?

This, in turn, is equivalent to the Halting Problem -- if you can solve 
one, you can solve the other. You might like to google on the Halting 
Problem before you spend too much time on this.

(Note, however, it isn't necessary to solve the Halting Problem for *all* 
cases in order to have a useful Endless Loop Detector program.)

Why do you think you need this? Seems to me you're starting on an 
extraordinarily difficult job. I hope the benefit is equally 
extraordinary.


[Aside: Python regexes aren't Turing Complete. I'm not sure about Perl 
regexes. Either way, this might actually be less difficult than the 
Halting Problem, as in amazingly difficult rather than impossible.]


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Alf P. Steinbach

* Steven D'Aprano:

On Sat, 06 Feb 2010 16:05:15 -0800, hzh...@gmail.com wrote:


Thanks for your reply.
So there isn't such a routine just because some of the regular
expressions cannot be enumerated. However, some of them can be
enumerated. I guess I have to write a function myself.


How do you expect to tell the ones that can be enumerated apart from 
those that can't be?


Regular expressions are programs in a regex programming language. What 
you are asking for is the same as saying:


Is there a program that can enumerate every possible set of data that is 
usable as valid input for a given program?


This, in turn, is equivalent to the Halting Problem -- if you can solve 
one, you can solve the other. You might like to google on the Halting 
Problem before you spend too much time on this.


Hm, well, text editors /regularly/ do repeated regular expression searches, 
producing match after match after match, on request.


To use that /expression/, it seems that Theory is yet again up against Hard 
Reality.

In such a contest where something doesn't quite /match/, is the Map, the 
Terrain, or perhaps the Interpretation of how the Map applies to Terrain, at fault?



(Note, however, it isn't necessary to solve the Halting Problem for *all* 
cases in order to have a useful Endless Loop Detector program.)


Why do you think you need this? Seems to me you're starting on an 
extraordinarily difficult job. I hope the benefit is equally 
extraordinary.


Depending on the application there may be more efficient ways than applying a 
general purpose regexp matcher.


Don't know about modern *nix but in the old days there were different greps for 
different purposes, egrep, fgrep, whatever.


Aside: the only article by Niklaus Wirth that I can remember reading was about 
how to transform algorithms to more efficient ones by exploiting the invariants, 
and one of his examples was simple text searching, where you can advance the 
pattern a number of characters depending on the current non-matching character.



[Aside: Python regexes aren't Turing Complete. I'm not sure about Perl 
regexes. Either way, this might actually be less difficult than the 
Halting Problem, as in amazingly difficult rather than impossible.]



Cheers,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Steven D'Aprano
On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:

 Regular expressions are programs in a regex programming language.
 What you are asking for is the same as saying:
 
 Is there a program that can enumerate every possible set of data that
 is usable as valid input for a given program?
 
 This, in turn, is equivalent to the Halting Problem -- if you can solve
 one, you can solve the other. You might like to google on the Halting
 Problem before you spend too much time on this.
 
 Hm, well, text editors /regularly/ do repeated regular expression
 searches, producing match after match after match, on request.

I think you have completely misunderstood what I'm saying.

I'm not saying that you can't *run* a regular expression against text and 
generate output. That truly would be a stupid thing to say, because I 
clearly can do this:

 import re
 mo = re.search(p.rr.t, 
... Some text containing parrots as well as other things)
 mo.group()
'parrot'

As you point out, it's not hard to embed a regex interpreter inside a 
text editor or other application, or to call an external library.

What is difficult, and potentially impossible, is to take an arbitrary 
regular expression such as p.rr.t (the program in the regex language) 
and generate every possible data (parrot, pbrrat, ...) that would 
give a match when applied to that regular expression.

Now, in this case, my example is very simple, and it would be easy to 
enumerate every possible data: there's only 65025 of them, limiting to 
the extended ASCII range excluding NUL (1-255). But for an arbitrary 
regex, it won't be that easy. Often it will be unbounded: the example of 
enumerating every string that matches .* has already been given.

The second problem is, generating the data which gives the output you 
want is potentially very, very, difficult, potentially as difficult as 
finding collisions in cryptographic hash functions:

Given the function hashlib.sha256, enumerate all the possible inputs 
that give the hexadecimal result 
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.

This too is unbounded, but you'll have your work cut out just to find 
*one* match, let alone an infinite number of them.

(In this specific example, your best bet is to try a crib: knowing what 
newsgroup this is, and knowing what I've written in the past, the message 
is predictable for being totally unexpected. And yes, that's a hint. A 
shiny penny for the first person to guess what it is.)

I'm suggesting that, in general, there's no way to tell in advance which 
regexes will be easy and which will be hard, and even when they are easy, 
the enumeration will often be infinite. 


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread MRAB

Alf P. Steinbach wrote:

* Steven D'Aprano:

On Sat, 06 Feb 2010 16:05:15 -0800, hzh...@gmail.com wrote:


Thanks for your reply.
So there isn't such a routine just because some of the regular
expressions cannot be enumerated. However, some of them can be
enumerated. I guess I have to write a function myself.


How do you expect to tell the ones that can be enumerated apart from 
those that can't be?


Regular expressions are programs in a regex programming language. 
What you are asking for is the same as saying:


Is there a program that can enumerate every possible set of data that 
is usable as valid input for a given program?


This, in turn, is equivalent to the Halting Problem -- if you can 
solve one, you can solve the other. You might like to google on the 
Halting Problem before you spend too much time on this.


Hm, well, text editors /regularly/ do repeated regular expression 
searches, producing match after match after match, on request.



[snip]
I'm not sure you understood what the OP was requesting: a way of
generating the strings which would match a given regex.
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Grant Edwards
On 2010-02-06, Roy Smith r...@panix.com wrote:

 I am a fresh man with python. I know there is regular expressions in
 Python. What I need is that given a particular regular expression,
 output all the matches.
[..]

 Please enumerate all the strings which match .*.  Use additional sheets 
 of paper if needed.

And be sure to show your work.

-- 
Grant

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Alf P. Steinbach

* Steven D'Aprano:

On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:


Regular expressions are programs in a regex programming language.
What you are asking for is the same as saying:

Is there a program that can enumerate every possible set of data that
is usable as valid input for a given program?

This, in turn, is equivalent to the Halting Problem -- if you can solve
one, you can solve the other. You might like to google on the Halting
Problem before you spend too much time on this.

Hm, well, text editors /regularly/ do repeated regular expression
searches, producing match after match after match, on request.


I think you have completely misunderstood what I'm saying.


Yes.


I'm not saying that you can't *run* a regular expression against text and 
generate output. That truly would be a stupid thing to say, because I 
clearly can do this:



import re
mo = re.search(p.rr.t, 

... Some text containing parrots as well as other things)

mo.group()

'parrot'

As you point out, it's not hard to embed a regex interpreter inside a 
text editor or other application, or to call an external library.


What is difficult, and potentially impossible, is to take an arbitrary 
regular expression such as p.rr.t (the program in the regex language) 
and generate every possible data (parrot, pbrrat, ...) that would 
give a match when applied to that regular expression.


Hm, that's not difficult to do, it's just like counting, but it's rather 
meaningless since either the output is trivial or it's in general exponential or 
infinite.


So it seems we both misunderstood the problem.

I didn't read the top level article until now, and reading it, I can't make 
sense of it.


It sounds like some kind of homework problem, but without the constraints that 
surely would be there in a homework problem.



Now, in this case, my example is very simple, and it would be easy to 
enumerate every possible data: there's only 65025 of them, limiting to 
the extended ASCII range excluding NUL (1-255). But for an arbitrary 
regex, it won't be that easy. Often it will be unbounded: the example of 
enumerating every string that matches .* has already been given.


The second problem is, generating the data which gives the output you 
want is potentially very, very, difficult, potentially as difficult as 
finding collisions in cryptographic hash functions:


Given the function hashlib.sha256, enumerate all the possible inputs 
that give the hexadecimal result 
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.


I tried some parrot variants but no dice. :-(


[snip]

I'm suggesting that, in general, there's no way to tell in advance which 
regexes will be easy and which will be hard, and even when they are easy, 
the enumeration will often be infinite.


I agree about the (implied) meaningless, exponential/infinite output, which 
means that possibly that's not what the OP meant, but disagree about the 
reasoning about no way: really, regular expressions are /very/ limited so it's 
not hard to compute up front the number of strings it can generate from some 
given character set, in time linear in the length of the regexp.


Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation 
that denotes digit sequence) yields an infinite number of strings.


And otherwise the regexp is of the form ABCDE..., where A, B, C etc are parts 
that each can generate a certain finite number of strings; multiplying these 
numbers gives the total number of strings that the regexp can generate.



Cheers,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread hzh...@gmail.com


 So it seems we both misunderstood the problem.

 I didn't read the top level article until now, and reading it, I can't make
 sense of it.


Seems that you should read the whole thing before making a post, or
else you cannot know what we are talking about.
Steven doesn't misunderstand me. We are talking about what I need, and
he tries to help.




  Given the function hashlib.sha256, enumerate all the possible inputs
  that give the hexadecimal result
  0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.

 I tried some parrot variants but no dice. :-(

 [snip]


This is a hash collision problem. Nobody has proved that SHA-256 is
collision free, even not in the random oracle model, because people
always suppose that a random oracle exists, and make hash function its
substitution. That means it may be broken someday. And any provable
security based on random oracle model is not secure.


  I'm suggesting that, in general, there's no way to tell in advance which
  regexes will be easy and which will be hard, and even when they are easy,
  the enumeration will often be infinite.

It is hard to tell in advance. However, we can add some timing limit
or counting limit, to make it an algorithm, which can halt. For
example, whenever the program outputs more than 100 expressions
that match the input regex, we can halt because that exceeds our
limit. But surely this is not efficient because of the post-decision.




 Essentially, any regexp that includes '+' or '*' (directly or via e.g. 
 notation
 that denotes digit sequence) yields an infinite number of strings.

Infinity is really relative, not absolute. It is relative to the
computing speed. For example, the regex '^[0|1]{2048}$' is rather
simple and doesn't contain '+' or '$', but trying to output all
expressions that match it has a complexity of 2^2048. If we can do
that, then we can break RSA-2048.
We must face the reality .

Zhuo

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Alf P. Steinbach

* hzh...@gmail.com:

So it seems we both misunderstood the problem.

I didn't read the top level article until now, and reading it, I can't make
sense of it.



[1] Seems that you should read the whole thing before making a post, or
else you cannot know what we are talking about.
Steven doesn't misunderstand me. We are talking about what I need, and
he tries to help.


If you were not misunderstood then you've posted a question for which there's no 
practical solution.




Given the function hashlib.sha256, enumerate all the possible inputs
that give the hexadecimal result
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91.

I tried some parrot variants but no dice. :-(

[snip]



This is a hash collision problem. Nobody has proved that SHA-256 is
collision free, even not in the random oracle model, because people
always suppose that a random oracle exists, and make hash function its
substitution. That means it may be broken someday. And any provable
security based on random oracle model is not secure.


Stephen's little challenge wasn't about breaking SHA-256 but about guessing his 
secret phrase, given his clues.




I'm suggesting that, in general, there's no way to tell in advance which
regexes will be easy and which will be hard, and even when they are easy,
the enumeration will often be infinite.


It is hard to tell in advance.


No, it's trivial.



[2] However, we can add some timing limit
or counting limit, to make it an algorithm, which can halt. For
example, whenever the program outputs more than 100 expressions
that match the input regex, we can halt because that exceeds our
limit. But surely this is not efficient because of the post-decision.


You don't need to wait for that output to complete. You can calculate the number 
of strings up front. Like it appears that you do below:




Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation
that denotes digit sequence) yields an infinite number of strings.


Infinity is really relative, not absolute. It is relative to the
computing speed. For example, the regex '^[0|1]{2048}$' is rather
simple and doesn't contain '+' or '$', but trying to output all
expressions that match it has a complexity of 2^2048. If we can do
that, then we can break RSA-2048.
We must face the reality .


Here it seems that you have no problem calculating number of combinations, yet 
above, at the paragraph marked [2], you talk about waiting for a million strings 
to be output before seeing that it's too much, and earlier, at the paragraph 
marked [1], you maintain that your original question about generating all such 
strings (completely impractical) was what you wanted help with?


I'm sorry but I can't make sense of this; it appears to be meaningless.

Perhaps if you tried to clarify the requirements a bit.


Cheers  hth.,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to print all expressions that match a regular expression

2010-02-06 Thread Nobody
On Sun, 07 Feb 2010 00:26:36 +, Steven D'Aprano wrote:

 So there isn't such a routine just because some of the regular
 expressions cannot be enumerated.

No. There isn't a routine because no-one has yet felt any need to write
one.

 However, some of them can be
 enumerated. I guess I have to write a function myself.
 
 How do you expect to tell the ones that can be enumerated apart from 
 those that can't be?

The ones which use the '+', '*' and '{m,}' operators match an infinite
number of strings; the rest can only match a finite number (assuming POSIX
REs; Python also has +? and *?).

[Enumerate isn't the correct word here. You can *enumerate* an
infinite set, in the sense that you could write a Python generator for
which any member will eventually be generated.]

The obvious implementation is to construct the NFA then run it.

If you know that the RE can only match finite strings (i.e. the graph is
acyclic), then you can enumerate them using depth-first traversal. If it
can match infinite strings (i.e. if there are any cycles in the graph),
then you would need to use either breadth-first traversal or
incrementally-bounded depth-first traversal.

 [Aside: Python regexes aren't Turing Complete. I'm not sure about Perl 
 regexes. Either way, this might actually be less difficult than the 
 Halting Problem, as in amazingly difficult rather than impossible.]

Regular expressions aren't Turing complete; this is implicit in the
definition of regular. The Chomsky hierarchy has four levels, with
higher levels require a more capable system to decide whether a string is
a member of the language defined by the grammar:

grammar decidable by

regular finite automaton
context-freepushdown automaton[1]
context-sensitive   linear-bounded automaton[2]
recursively-enumerable  Turing machine

However, any regular expression syntax which allows backreferences
(including the POSIX specification) isn't actually regular in the formal
sense (as it requires an infinite number of states), but context-free.

[1] pushdown automaton = finite automaton with a stack

[2] linear-bounded automaton = Turing machine, except that it's tape is
finite and proportional to the size of the input.

http://en.wikipedia.org/wiki/Chomsky_hierarchy

-- 
http://mail.python.org/mailman/listinfo/python-list