Twitter hashtag RegEx help

2009-08-10 Thread Paul Vernon

Ok, so I thought I'd cracked this and I have to some extent but not completely.

The well known hashtag RegEx I see in most Twitter examples is a variant of 
this ##([a-z0-9\-_]+)

The problem with this is that if you run it over a string that contains HTML 
entities, it will recognise those HTML entities as hashtags too...

E.g. #mytrendingtopic is a valid hashtag but this isn#39;t.

#mytrendingtopic *and* #39 are recognised as hashtags...

Now, my latest crack at this is a bit more complex and looks like this:

##(([a-z_\-]+[0-9_\-]*[a-z0-9_\-]+)|([0-9_\-]+[a-z_\-]+[a-z0-9_\-]+))

Which picks out:

1. All hashtags starting with alpha and containing 0 or more numbers with _ an 
- at any position
2. All hashtags starting with numbers and containing 1 or more alpha with _ an 
- at any position

This RegEx works well but it's still not quite right. The problem with this 
though in that if the film 2001 or 2010 we're hashtags e.g. #2001 or #2010 then 
they would get missed by the RegEx. All other hashtags are recognised just fine 
and HTML entities are ignored so for the most part it's better than the 
original RegEx as widely used.

I've been working on a fix for this problem and been looking at using lookahead 
and lookbehind but it seems CF doesn't support all the features I need, i.e. no 
negative lookbehind.

So if anyone can improve on my current RegEx so I can pick out #mytrendingtopic 
but not #39 from the above example, I'd appreciate it very, very much...

Paul




~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325319
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Twitter hashtag RegEx help

2009-08-10 Thread Mahcsig

You could look for hash tags without an  in front of it.  something like
this: (?:[^]|^)##([a-z0-9_\-]+)
~Mahcsig



On Mon, Aug 10, 2009 at 10:24 AM, Paul Vernon 
paul.ver...@web-architect.co.uk wrote:


 Ok, so I thought I'd cracked this and I have to some extent but not
 completely.

 The well known hashtag RegEx I see in most Twitter examples is a variant of
 this ##([a-z0-9\-_]+)

 The problem with this is that if you run it over a string that contains
 HTML entities, it will recognise those HTML entities as hashtags too...

 E.g. #mytrendingtopic is a valid hashtag but this isn#39;t.

 #mytrendingtopic *and* #39 are recognised as hashtags...

 Now, my latest crack at this is a bit more complex and looks like this:

 ##(([a-z_\-]+[0-9_\-]*[a-z0-9_\-]+)|([0-9_\-]+[a-z_\-]+[a-z0-9_\-]+))

 Which picks out:

 1. All hashtags starting with alpha and containing 0 or more numbers with _
 an - at any position
 2. All hashtags starting with numbers and containing 1 or more alpha with _
 an - at any position

 This RegEx works well but it's still not quite right. The problem with this
 though in that if the film 2001 or 2010 we're hashtags e.g. #2001 or #2010
 then they would get missed by the RegEx. All other hashtags are recognised
 just fine and HTML entities are ignored so for the most part it's better
 than the original RegEx as widely used.

 I've been working on a fix for this problem and been looking at using
 lookahead and lookbehind but it seems CF doesn't support all the features I
 need, i.e. no negative lookbehind.

 So if anyone can improve on my current RegEx so I can pick out
 #mytrendingtopic but not #39 from the above example, I'd appreciate it very,
 very much...

 Paul




 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325325
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Twitter hashtag RegEx help

2009-08-10 Thread Paul Vernon

 You could look for hash tags without an  in front of it.  something
 like
 this: (?:[^]|^)##([a-z0-9_\-]+)
 ~Mahcsig
 

That was close, due to the needs of the code, I had to modify it a bit but
you definitely put me on the right track.

In the end, I arrived at this ([^]|^)##([a-z0-9_\-]+) as I needed the first
group to return too because the RegEx was matching the preceding character
whether that was a space, comma or whatever so I needed it so I could
reference it in the replace.

ReReplaceNoCase(arguments.tweet, ([^]|^)##([a-z0-9_\-]+), \1a
href=http://twitter.com/search?q=%23\2;##\2/a, ALL)

Thanks...

Paul





~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325326
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4