Re: Extracting patterns after matching a regex
On Sep 9, 4:58 pm, Al Fansome al_fans...@hotmail.com wrote: Mart. wrote: On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME:
Re: Extracting patterns after matching a regex
On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 18975\n'] Using 'str' on that list then converts
Re: Extracting patterns after matching a regex
Mart. wrote: On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 18975\n'] Using 'str' on that list then converts it to s string _representation_ of that list: str(lines) ['TOTAL FILES: 2\\n', '\\t\\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: 11028908\\n', '\\n',
Re: Extracting patterns after matching a regex
Mart. wrote: On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 18975\n'] Using 'str' on that list then converts it to s string _representation_ of that list: str(lines) ['TOTAL FILES: 2\\n', '\\t\\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: 11028908\\n', '\\n',
Extracting patterns after matching a regex
Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Martin wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, m = re.search(rFTPHOST: (.*), s) print m.group(1) -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 1:56 pm, Martin mdeka...@gmail.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin What you're doing is telling python look for all matches of 'FTPHOST'. That doesn't really help you much, because you pretty much expect FTPHOST to be there anyway, so finding it means squat. What you _really_ want to tell it is Look for things shaped like 'FTPHOST: ftpaddress', and tell me what ftpaddress actually is. Look here: http://docs.python.org/howto/regex.html#grouping. That'll explain how to accomplish what you're trying to do. -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Martin mdeka...@gmail.com wrote in message news:5941d8f1-27c0-47d9-8221-d21f07200...@j39g2000yqh.googlegroups.com... Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! In regular expressions, you match the entire string you are interested in, and parenthesize the parts that you want to parse out of that string. The group() method is used to get the whole string with group(0), and each of the parenthesized parts with group(n). An example: s = FTPHOST: e4ftl01u.ecs.nasa.gov import re re.search(r'FTPHOST: (.*)',s).group(0) 'FTPHOST: e4ftl01u.ecs.nasa.gov' re.search(r'FTPHOST: (.*)',s).group(1) 'e4ftl01u.ecs.nasa.gov' -Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 2:15 pm, MRAB pyt...@mrabarnett.plus.com wrote: Martin wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, m = re.search(rFTPHOST: (.*), s) print m.group(1) so the .* means to match everything after the regex? That doesn't help in this case as the string is placed amongst others for example. MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 2:21 pm, Mark Tolonen metolone+gm...@gmail.com wrote: Martin mdeka...@gmail.com wrote in message news:5941d8f1-27c0-47d9-8221-d21f07200...@j39g2000yqh.googlegroups.com... Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! In regular expressions, you match the entire string you are interested in, and parenthesize the parts that you want to parse out of that string. The group() method is used to get the whole string with group(0), and each of the parenthesized parts with group(n). An example: s = FTPHOST: e4ftl01u.ecs.nasa.gov import re re.search(r'FTPHOST: (.*)',s).group(0) 'FTPHOST: e4ftl01u.ecs.nasa.gov' re.search(r'FTPHOST: (.*)',s).group(1) 'e4ftl01u.ecs.nasa.gov' -Mark I see what you mean regarding the groups. Because my string is nested in amongst others e.g. MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', I get the information that follows as well. So is the only way to then parse the new string? I am trying to construct something that is fairly robust, so not sure just printing before the \r is the best solution. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote: On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks It is not clear from your post what the input is really like. But just guessing this might work: print s 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r \n','Ftp Pull Download Links: \r\n' re.search(r'FTPHOST: (.*?)\\r',s).group(1) 'e4ftl01u.ecs.nasa.gov' Except, I'm assuming, the OP's getting the data from a (windows- formatted) file, so \r\n shouldn't be escaped in the regex: re.search(r'FTPHOST: (.*?)\r',s).group(1) -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
RE: Extracting patterns after matching a regex
Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Re: Extracting patterns after matching a regex
Mart. wrote: snip I have been doing this to turn the email into a string email =ys.argv[1] f =open(email, 'r') s =str(f.readlines()) so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', snip The mistake I see is trying to turn a list into a string, just so you can try to parse it back again. Just write a loop that iterates through the list that readlines() returns. DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Mart. wrote: On Sep 8, 2:15 pm, MRAB pyt...@mrabarnett.plus.com wrote: Martin wrote: Hi, I need to extract a string after a matching a regular expression. Whether or not you need re is an issue to be determined. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. Just split the string on ': ' and take the second part. Or find the position of the space and slice the remainder. so the .* means to match everything after the regex? That doesn't help in this case It helps in the case you presented. as the string is placed amongst others for example. MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', What you show above is a tuple of strings. Scan the members looking for s.startswith('FTPHOST:') and apply previous answer. Or if above is actually meant to be one string (with quotes omitted), split in ',' and apply previous answer. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks It is not clear from your post what the input is really like. But just guessing this might work: print s 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r \n','Ftp Pull Download Links: \r\n' re.search(r'FTPHOST: (.*?)\\r',s).group(1) 'e4ftl01u.ecs.nasa.gov' -- http://mail.python.org/mailman/listinfo/python-list
RE: Extracting patterns after matching a regex
Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 18975\n'] Using 'str' on that list then converts it to s string _representation_ of that list: str(lines) ['TOTAL FILES: 2\\n', '\\t\\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: 11028908\\n', '\\n', '\\t\\tFILENAME:
Re: Extracting patterns after matching a regex
On Sep 8, 10:27 am, pdpi pdpinhe...@gmail.com wrote: On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote: On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks It is not clear from your post what the input is really like. But just guessing this might work: print s 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r \n','Ftp Pull Download Links: \r\n' re.search(r'FTPHOST: (.*?)\\r',s).group(1) 'e4ftl01u.ecs.nasa.gov' Except, I'm assuming, the OP's getting the data from a (windows- formatted) file, so \r\n shouldn't be escaped in the regex: re.search(r'FTPHOST: (.*?)\r',s).group(1) I am just playing the guessing game like everybody else here. Since the OP didn't use re.DOTALL and was getting more than one line for .* I assumed that the \n was quite literally '\' and 'n'. -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 10:25 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote: On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks It is not clear from your post what the input is really like. But just guessing this might work: print s 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r \n','Ftp Pull Download Links: \r\n' re.search(r'FTPHOST: (.*?)\\r',s).group(1) 'e4ftl01u.ecs.nasa.gov' Hi, That does work. So the \ escapes the \r, does this tell it to stop when it reaches the \r? Thanks Indeed. -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 11:19 am, Dave Angel da...@ieee.org wrote: Mart. wrote: snip I have been doing this to turn the email into a string email =ys.argv[1] f =open(email, 'r') s =str(f.readlines()) so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', snip The mistake I see is trying to turn a list into a string, just so you can try to parse it back again. Just write a loop that iterates through the list that readlines() returns. DaveA No kidding. Instead of this: s = str(f.readlines()) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir I would have possibly done something like this (not tested): lines = f.readlines() header={} for row in lines: key,sep,value = row.partition(':')[2].rstrip() header[key.lower()]=value url = 'ftp://' + header['ftphost'] + header['ftpdir'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 12:16 pm, nn prueba...@latinmail.com wrote: On Sep 8, 11:19 am, Dave Angel da...@ieee.org wrote: Mart. wrote: snip I have been doing this to turn the email into a string email =ys.argv[1] f =open(email, 'r') s =str(f.readlines()) so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', snip The mistake I see is trying to turn a list into a string, just so you can try to parse it back again. Just write a loop that iterates through the list that readlines() returns. DaveA No kidding. Instead of this: s = str(f.readlines()) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir I would have possibly done something like this (not tested): lines = f.readlines() header={} for row in lines: key,sep,value = row.partition(':')[2].rstrip() header[key.lower()]=value url = 'ftp://' + header['ftphost'] + header['ftpdir'] Well I said not tested that would be of course: lines = f.readlines() header={} for row in lines: key,sep,value = row.partition(':') header[key.lower()]=value.rstrip() url = 'ftp://' + header['ftphost'] + header['ftpdir'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote: Mart. wrote: On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) To me that seems a strange thing to do. You could just read the entire file as a string: f = open(email, 'r') s = f.read() so FTPHOST isn't the first element, it is just part of a larger string. When I turn the email into a string it looks like... 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', So not sure splitting it like you suggested works in this case. Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(rMOD\.\.h..v..\.005\..\ \, s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Suppose the file contains your example text above. Using 'readlines' returns a list of the lines: f = open(email, 'r') lines = f.readlines() lines ['TOTAL FILES: 2\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 11028908\n', '\n', '\t\tFILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 18975\n'] Using 'str' on that list then converts
Re: Extracting patterns after matching a regex
On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote: On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote: On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote: Hi, I need to extract a string after a matching a regular expression. For example I have the string... s = FTPHOST: e4ftl01u.ecs.nasa.gov and once I match FTPHOST I would like to extract e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the problem, I had been trying to match the string using something like this: m = re.findall(rFTPHOST, s) But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov part. Perhaps I need to find the string and then split it? I had some help with a similar problem, but now I don't seem to be able to transfer that to this problem! Thanks in advance for the help, Martin No need for regex. s = FTPHOST: e4ftl01u.ecs.nasa.gov If FTPHOST in s: return s[9:] Cheers, Drea Sorry perhaps I didn't make it clear enough, so apologies. I only presented the example s = FTPHOST: e4ftl01u.ecs.nasa.gov as I thought this easily encompassed the problem. The solution presented works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But when I used this on the actual file I am trying to parse I realised it is slightly more complicated as this also pulls out other information, for example it prints e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', etc. So I need to find a way to stop it before the \r slicing the string wouldn't work in this scenario as I can envisage a situation where the string lenght increases and I would prefer not to keep having to change the string. Many thanks It is not clear from your post what the input is really like. But just guessing this might work: print s 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r \n','Ftp Pull Download Links: \r\n' re.search(r'FTPHOST: (.*?)\\r',s).group(1) 'e4ftl01u.ecs.nasa.gov' Hi, That does work. So the \ escapes the \r, does this tell it to stop when it reaches the \r? Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Extracting patterns after matching a regex
Mart. wrote: If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(:)[1].strip() will work. It is an email which contains information before and after the main section I am interested in, namely... FINISHED: 09/07/2009 08:42:31 MEDIATYPE: FtpPull MEDIAFORMAT: FILEFORMAT FTPHOST: e4ftl01u.ecs.nasa.gov FTPDIR: /PullDir/0301872638CySfQB Ftp Pull Download Links: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB Down load ZIP file of packaged order: ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip FTPEXPR: 09/12/2009 08:42:31 MEDIA 1 of 1 MEDIAID: I have been doing this to turn the email into a string email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) So don't do that. Or rather, scan the list of lines returned by .readlines *before* dumping it all into one line. Or, try the email module. When the email parser returns a .message.Message instance, msg['FTPHOST'] will give you what you want. tjr -- http://mail.python.org/mailman/listinfo/python-list