ot: regular expression help

2009-07-07 Thread Aryeh M. Friedman
I am attempting to make (without the perl expansions) a regular 
expansion that when used as a delim will split words on any 
punction/whitespace character *EXCEPT* $ (for java people I want to 
feed it into something like this:


for(String foo:input.split([insert regex here])
   ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: ot: regular expression help

2009-07-07 Thread Steve Bertrand
Aryeh M. Friedman wrote:
 I am attempting to make (without the perl expansions) a regular
 expansion that when used as a delim will split words on any
 punction/whitespace character *EXCEPT* $ (for java people I want to
 feed it into something like this:
 
 for(String foo:input.split([insert regex here])

Since regexs are generally portable, here is a Perl version that splits
on any non alpha-num including spaces, but disregards the '$'. (regex
between / and /. Even though you said no Perl, I did it anyway ;)

my $string = 'hello%wo$rld*ste ve';

print join (', ', (split(/[^\w\$]/, $string)));

...output:

hello, wo$rld, ste, ve

Steve




smime.p7s
Description: S/MIME Cryptographic Signature


Re: ot: regular expression help

2009-07-07 Thread Matthew Seaman

Aryeh M. Friedman wrote:
I am attempting to make (without the perl expansions) a regular 
expansion that when used as a delim will split words on any 
punction/whitespace character *EXCEPT* $ (for java people I want to 
feed it into something like this:


for(String foo:input.split([insert regex here])
   ...


Well, there's no way to say all foo except bar using standard regexes, so
you can't use the [:punct:] character class. You'll have to roll your own
class.

If your input is ASCII then see ispunct(3) for a handy list of all the
ascii punctuation characters.  I guess you'll need a RE something like this:

  []!#%'\(\)\*\+,\./:;=?...@[\\^_`{\|}~-[:space:]]+

although that's completely untried, quite likely to not have all the
metacharacters properly escaped (exactly what is or isn't a metacharacter
depends on the RE implementation you're using) and is probably horribly
confused due to the inclusion of '[' '-' and ']' amongst the characters
matched in the range.  


If you're using anything other than ascii, then I suspect you're going
to have problems with RE libs anyhow, unless you can somehow use PCRE.  
The \p{isPunct} and \p{isWhite} escapes for matching unicode punctuation

or whitespace is probably what you need.

Even so, your best choice would probably be to separately check strings
for the presence of $ characters -- maybe transform those $ characters to
something else -- and then split on any remaining punctuation characters.

Cheers,

Matthew

--
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
 Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
 Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature