Twitter hashtag RegEx help
Ok, so I thought I'd cracked this and I have to some extent but not completely. The well known hashtag RegEx I see in most Twitter examples is a variant of this ##([a-z0-9\-_]+) The problem with this is that if you run it over a string that contains HTML entities, it will recognise those HTML entities as hashtags too... E.g. #mytrendingtopic is a valid hashtag but this isn#39;t. #mytrendingtopic *and* #39 are recognised as hashtags... Now, my latest crack at this is a bit more complex and looks like this: ##(([a-z_\-]+[0-9_\-]*[a-z0-9_\-]+)|([0-9_\-]+[a-z_\-]+[a-z0-9_\-]+)) Which picks out: 1. All hashtags starting with alpha and containing 0 or more numbers with _ an - at any position 2. All hashtags starting with numbers and containing 1 or more alpha with _ an - at any position This RegEx works well but it's still not quite right. The problem with this though in that if the film 2001 or 2010 we're hashtags e.g. #2001 or #2010 then they would get missed by the RegEx. All other hashtags are recognised just fine and HTML entities are ignored so for the most part it's better than the original RegEx as widely used. I've been working on a fix for this problem and been looking at using lookahead and lookbehind but it seems CF doesn't support all the features I need, i.e. no negative lookbehind. So if anyone can improve on my current RegEx so I can pick out #mytrendingtopic but not #39 from the above example, I'd appreciate it very, very much... Paul ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325319 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Twitter hashtag RegEx help
You could look for hash tags without an in front of it. something like this: (?:[^]|^)##([a-z0-9_\-]+) ~Mahcsig On Mon, Aug 10, 2009 at 10:24 AM, Paul Vernon paul.ver...@web-architect.co.uk wrote: Ok, so I thought I'd cracked this and I have to some extent but not completely. The well known hashtag RegEx I see in most Twitter examples is a variant of this ##([a-z0-9\-_]+) The problem with this is that if you run it over a string that contains HTML entities, it will recognise those HTML entities as hashtags too... E.g. #mytrendingtopic is a valid hashtag but this isn#39;t. #mytrendingtopic *and* #39 are recognised as hashtags... Now, my latest crack at this is a bit more complex and looks like this: ##(([a-z_\-]+[0-9_\-]*[a-z0-9_\-]+)|([0-9_\-]+[a-z_\-]+[a-z0-9_\-]+)) Which picks out: 1. All hashtags starting with alpha and containing 0 or more numbers with _ an - at any position 2. All hashtags starting with numbers and containing 1 or more alpha with _ an - at any position This RegEx works well but it's still not quite right. The problem with this though in that if the film 2001 or 2010 we're hashtags e.g. #2001 or #2010 then they would get missed by the RegEx. All other hashtags are recognised just fine and HTML entities are ignored so for the most part it's better than the original RegEx as widely used. I've been working on a fix for this problem and been looking at using lookahead and lookbehind but it seems CF doesn't support all the features I need, i.e. no negative lookbehind. So if anyone can improve on my current RegEx so I can pick out #mytrendingtopic but not #39 from the above example, I'd appreciate it very, very much... Paul ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325325 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Twitter hashtag RegEx help
You could look for hash tags without an in front of it. something like this: (?:[^]|^)##([a-z0-9_\-]+) ~Mahcsig That was close, due to the needs of the code, I had to modify it a bit but you definitely put me on the right track. In the end, I arrived at this ([^]|^)##([a-z0-9_\-]+) as I needed the first group to return too because the RegEx was matching the preceding character whether that was a space, comma or whatever so I needed it so I could reference it in the replace. ReReplaceNoCase(arguments.tweet, ([^]|^)##([a-z0-9_\-]+), \1a href=http://twitter.com/search?q=%23\2;##\2/a, ALL) Thanks... Paul ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:325326 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4