On 31/05/11 11:54, Ghassan Gharabli wrote:
Hello again,

         #generic http://variable.domain.com/path/filename."ex";, "ext" or "exte"
         #http://cdn1-28.projectplaylist.com
         #http://s1sdlod041.bcst.cdn.s1s.yimg.com
#} elsif 
(m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
{
#        @y = ($1,$2,$3,$4);
#        $y[0] =~
s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;
#        print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "."
. $y[3] . "\n";


Why we had to use arrays in this example.
I understood that m/ indicates a regex match operation , "\n" to break
the line and we assined @y as an array which has
4 values we used to call each one for example we call $1 the first
record as y[0] ..till now its fine for me
and we assign a value to y[0] =~ $y[0] =~
s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;
...

Please correct me if im wrong here.Im still confused about those
values $1 , $2 , $3 ..
how does the program know where to locate $1 or $2 as there is no
values or $strings anyway
as I have noticed that $1 means an element for example
http://cdn1-28.projectplaylist.com can be grouped as elements .. Hope
Im correct on this one
http://(cdn1-28) . (projectplaylist) . (com) should be http:// $1 . $2 . $3


m//  produces $1, $2, ... $9  for each () element in the pattern.

s// will produce different $1, $2, ... etc. You have to save the ones from m// somewhere if you want to use them after s//. The person who wrote that saves them in the array y[].


Then let me see if I can solve this one to match this URL
http://down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.nogomi.com/M15/Alaa_Zalzaly/Atrak/Nogomi.com_Alaa_Zalzaly-3ali_Tar.mp3

so I should work around the FQDN and leave the rest as is, please if
you found any wrong this then correct it for me
#does that match
http://down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.nogomi.com/M15/Alaa_Zalzaly/Atrak/Nogomi.com_Alaa_Zalzaly-3ali_Tar.mp3
   ??
elsif (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
{
       @y = ($1,$2,$3,$4);
       $y[0] =~ s/[a-z0-9A-Z\.\-]+/cdn/
       print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "." .
$y[3] . "\n";


does this example matches Nogomi.com domain correctly ?

and why u used s/[a-z0-9A-Z\.\-]+/cdn/

I only understood that you are mnaking sure to find small letters ,
cap letters , numbers but I believe \. is to search
for one dot only .. how about if there is 2 dots or more that 3 dots
in this case! .. another one u r finding dash ..

That pattern ends with "+". To search for "one or more" of the listed safe domain letters.

It matches all of the $y[0] content:
"down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp".

While also not-matching bad things like
  "http://evil.com?url=http://nogomi...";

(the one I gave will covert "http://evil.com?url=http://nogomi..."; -->
"cdn://evil.com?url=http://nogomi..."; )



The only thing im confused about is why we have added /cdn/ since the
url doesnt has a word "cdn"?

This is a "s//" operation. ('s' meaning 'switch'). *IF* the $y[0] value matches the pattern for a domain s// will place "cdn" instead of that matched piece.

So what this does is change *.nogomi.com -->  "cdn.nogomi.com"

If there are any bad stuff like my evil.com example going on it will screw with those URL as well. BUT the bits there will not map to "cdn.nagomi.com" so will not corrupt the actual CDN content.

Thinking about it a bit more I should have been more careful and told you:
  s/^[a-z0-9A-Z\.\-]+$/cdn/

which will ONLY match if the $y[0] as a whole is a valid host name text.


Why we have used storeurl:// because I can see some of examples are
print $x . "http://"; . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . "\n";

can you give me an example to add the portion of $y[1] please..

elsif (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
 {
   @y = ($1,$2,$3,$4);

   if (m/$y[1]/nagomi.com/) {
     $y[0] =~ s/[a-z0-9A-Z\.\-]+/cdn/
   } else {

$y[0] =~ s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;

   }

print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . "\n";

 }


Which one have your interests , writing a script to match the most
similar examples in one rule or writing each script for each FQDN?

The example you started with had some complex details built into its s// matching. So that particular CDN syntax would be detected and replaced. This is useful if the CDN is only some sub-domains of the main site. And there are other non-CDN subdomains to be avoided. The nasty CDN.


The one I've just put above is for use when the site just uses all its subdomains as CDN for the same content. These are the semi-friendly CDN. You can extend that for other CDN by adding their base domains to the m// test. ie if (m/(nagomi|example)\.com|example\.net/) ...

This is only safe when the subdomain portion has meaning to the CDN operator as:
 a) their client account token
 b) their data center routing tagging
 c) their load-balanced server hostname
 .. or similar internal *routing* details.

If there is any content clash on the URL-path portion it cannot be done like this. As I said at the start you MUST BE CERTAIN of the meaning of the bits you are removing.


The horribly nasty ones (looking at akamai here) need the other end of the CDN domain stripped off. exmaple.com.akamai.com --> example.com

None of these patterns we have talked about so far is suitable for those ones.


for example sometimes we see
http://down2.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.example.com/folder/filename.ext
or
http://cdn.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.example2.com/xn55571528exgem0o65xymsgtmjiy75924mjqqybp/folder/filename.ext

really that is interesting to me , that is why I would love to match
this too as well but the thing is if I knew all of these things ..
everything would be fine for me

That is getting extra complex.

I think you have so far been obfuscating the actual URLs and details for examples. Domain bits are relatively easy due to their limited character set.

Path bits must be coded for particular instances often with exacting knowledge. hello.txt and hello.Txt are not necessarily the same file for example. Bes sure what you "know" is correct, and that your patterns work. Then test, re-test, and test again. Then when its working. re-test.


Again I want to thank you for answering my questions as I felt like Im
writing a magazine heheheh

And you get a book back :)
Welcome. Though this is about as far as I can go on examples and generics. My own skill with regex is not that great. (10 years in and I'm still learning "common knowledge" details about it.)

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.12
  Beta testers wanted for 3.2.0.7 and 3.1.12.1

Reply via email to