On Sep 8, 4:33 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > Mart. wrote: > > On Sep 8, 3:53 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > >> Mart. wrote: > >>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t...@ubisoft.com> wrote: > >>>>>>> Hi, > >>>>>>> I need to extract a string after a matching a regular expression. For > >>>>>>> example I have the string... > >>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>>>>> and once I match "FTPHOST" I would like to extract > >>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the > >>>>>>> problem, I had been trying to match the string using something like > >>>>>>> this: > >>>>>>> m = re.findall(r"FTPHOST", s) > >>>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov" > >>>>>>> part. Perhaps I need to find the string and then split it? I had some > >>>>>>> help with a similar problem, but now I don't seem to be able to > >>>>>>> transfer that to this problem! > >>>>>>> Thanks in advance for the help, > >>>>>>> Martin > >>>>>> No need for regex. > >>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>>>> If "FTPHOST" in s: > >>>>>> return s[9:] > >>>>>> Cheers, > >>>>>> Drea > >>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only > >>>>> presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I > >>>>> thought this easily encompassed the problem. The solution presented > >>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But > >>>>> when I used this on the actual file I am trying to parse I realised it > >>>>> is slightly more complicated as this also pulls out other information, > >>>>> for example it prints > >>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', > >>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ > >>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', > >>>>> etc. So I need to find a way to stop it before the \r > >>>>> slicing the string wouldn't work in this scenario as I can envisage a > >>>>> situation where the string lenght increases and I would prefer not to > >>>>> keep having to change the string. > >>>> If, as Terry suggested, you do have a tuple of strings and the first > >>>> element has FTPHOST, then s[0].split(":")[1].strip() will work. > >>> It is an email which contains information before and after the main > >>> section I am interested in, namely... > >>> FINISHED: 09/07/2009 08:42:31 > >>> MEDIATYPE: FtpPull > >>> MEDIAFORMAT: FILEFORMAT > >>> FTPHOST: e4ftl01u.ecs.nasa.gov > >>> FTPDIR: /PullDir/0301872638CySfQB > >>> Ftp Pull Download Links: > >>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB > >>> Down load ZIP file of packaged order: > >>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip > >>> FTPEXPR: 09/12/2009 08:42:31 > >>> MEDIA 1 of 1 > >>> MEDIAID: > >>> I have been doing this to turn the email into a string > >>> email = sys.argv[1] > >>> f = open(email, 'r') > >>> s = str(f.readlines()) > >> To me that seems a strange thing to do. You could just read the entire > >> file as a string: > > >> f = open(email, 'r') > >> s = f.read() > > >>> so FTPHOST isn't the first element, it is just part of a larger > >>> string. When I turn the email into a string it looks like... > >>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', > >>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', > >>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r > >>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down > >>> load ZIP file of packaged order:\r\n', > >>> So not sure splitting it like you suggested works in this case. > > > Within the file are a list of files, e.g. > > > TOTAL FILES: 2 > > FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf > > FILESIZE: 11028908 > > > FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml > > FILESIZE: 18975 > > > and what i want to do is get the ftp address from the file and collect > > these files to pull down from the web e.g. > > > MOD13A2.A2007033.h17v08.005.2007101023605.hdf > > MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml > > > Thus far I have > > > #!/usr/bin/env python > > > import sys > > import re > > import urllib > > > email = sys.argv[1] > > f = open(email, 'r') > > s = str(f.readlines()) > > m = re.findall(r"MOD....\.........\.h..v..\.005\..............\.... > > \....", s) > > > ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) > > ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) > > url = 'ftp://' + ftphost + ftpdir > > > for i in xrange(len(m)): > > > print i, ':', len(m) > > file1 = m[i][:-4] # remove xml bit. > > file2 = m[i] > > > urllib.urlretrieve(url, file1) > > urllib.urlretrieve(url, file2) > > > which works, clearly my match for the MOD13A2* files isn't ideal I > > guess, but they will always occupt those dimensions, so it should > > work. Any suggestions on how to improve this are appreciated. > > Suppose the file contains your example text above. Using 'readlines' > returns a list of the lines: > > >>> f = open(email, 'r') > >>> lines = f.readlines() > >>> lines > ['TOTAL FILES: 2\n', '\t\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: > 11028908\n', '\n', '\t\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: > 18975\n'] > > Using 'str' on that list then converts it to s string _representation_ > of that list: > > >>> str(lines) > "['TOTAL FILES: 2\\n', '\\t\\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: > 11028908\\n', '\\n', '\\t\\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE: > 18975\\n']" > > That just parsing a lot more difficult. > > It's much easier to just read the entire file as a single string and > then parse that: > > >>> f = open(email, 'r') > >>> s = f.read() > >>> s > 'TOTAL FILES: 2\n\t\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE: > 11028908\n\n\t\tFILENAME: > MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n' > >>> import re > >>> re.findall(r"FILENAME: (.+)", s) > ['MOD13A2.A2007033.h17v08.005.2007101023605.hdf', > 'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']
If I do it this way I can't seem to not extract the \r at the end of the line. In [26]: m = re.search(r"FTPHOST: (.+)", s) In [27]: m.group(1) Out[27]: 'e4ftl01u.ecs.nasa.gov\r' but if I insert \\r at the end as was previously suggested. In [28]: m = re.search(r"FTPHOST: (.+)\\r", s) In [29]: m.group(1) AttributeError: 'NoneType' object has no attribute 'group' Any thoughts? Thanks -- http://mail.python.org/mailman/listinfo/python-list