subject:"Extracting patterns after matching a regex"

Re: Extracting patterns after matching a regex

2009-09-11 Thread Mart.

On Sep 9, 4:58 pm, Al Fansome al_fans...@hotmail.com wrote:
 Mart. wrote:
  On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Mart. wrote:
  On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Mart. wrote:
  On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
  Hi,
  I need to extract a string after a matching a regular expression. 
  For
  example I have the string...
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to 
  the
  problem, I had been trying to match the string using something like
  this:
  m = re.findall(rFTPHOST, s)
  But I couldn't then work out how to return the 
  e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had 
  some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!
  Thanks in advance for the help,
  Martin
  No need for regex.
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]
  Cheers,
  Drea
  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
  etc. So I need to find a way to stop it before the \r
  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.
  If, as Terry suggested, you do have a tuple of strings and the first 
  element has FTPHOST, then s[0].split(:)[1].strip() will work.
  It is an email which contains information before and after the main
  section I am interested in, namely...
  FINISHED: 09/07/2009 08:42:31
  MEDIATYPE: FtpPull
  MEDIAFORMAT: FILEFORMAT
  FTPHOST: e4ftl01u.ecs.nasa.gov
  FTPDIR: /PullDir/0301872638CySfQB
  Ftp Pull Download Links:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
  Down load ZIP file of packaged order:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
  FTPEXPR: 09/12/2009 08:42:31
  MEDIA 1 of 1
  MEDIAID:
  I have been doing this to turn the email into a string
  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  To me that seems a strange thing to do. You could just read the entire
  file as a string:
       f = open(email, 'r')
       s = f.read()
  so FTPHOST isn't the first element, it is just part of a larger
  string. When I turn the email into a string it looks like...
  'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
  'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
  'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
  \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
  load ZIP file of packaged order:\r\n',
  So not sure splitting it like you suggested works in this case.
  Within the file are a list of files, e.g.
  TOTAL FILES: 2
             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
             FILESIZE: 11028908
             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
             FILESIZE: 18975
  and what i want to do is get the ftp address from the file and collect
  these files to pull down from the web e.g.
  MOD13A2.A2007033.h17v08.005.2007101023605.hdf
  MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
  Thus far I have
  #!/usr/bin/env python
  import sys
  import re
  import urllib
  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  m = re.findall(rMOD\.\.h..v..\.005\..\
  \, s)
  ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
  ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
  url = 'ftp://' + ftphost + ftpdir
  for i in xrange(len(m)):
     print i, ':', len(m)
     file1 = m[i][:-4]               # remove xml bit.
     file2 = m[i]
     urllib.urlretrieve(url, file1)
     urllib.urlretrieve(url, file2)
  which works, clearly my match for the MOD13A2* files isn't ideal I
  guess, but they will always occupt those dimensions, so it should
  work. Any suggestions on how to improve this are appreciated.
  Suppose the file contains your example text above. Using 'readlines'
  returns a list of the lines:

    f = open(email, 'r')
    lines = f.readlines()
    lines
  ['TOTAL FILES: 2\n', '\t\tFILENAME:
  MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
  11028908\n', '\n', '\t\tFILENAME:

Re: Extracting patterns after matching a regex

2009-09-09 Thread Mart.

On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Mart. wrote:
  On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:
  Mart. wrote:
  On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
  Hi,
  I need to extract a string after a matching a regular expression. For
  example I have the string...
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:
  m = re.findall(rFTPHOST, s)
  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!
  Thanks in advance for the help,
  Martin
  No need for regex.
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]
  Cheers,
  Drea
  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
  etc. So I need to find a way to stop it before the \r
  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.
  If, as Terry suggested, you do have a tuple of strings and the first 
  element has FTPHOST, then s[0].split(:)[1].strip() will work.
  It is an email which contains information before and after the main
  section I am interested in, namely...
  FINISHED: 09/07/2009 08:42:31
  MEDIATYPE: FtpPull
  MEDIAFORMAT: FILEFORMAT
  FTPHOST: e4ftl01u.ecs.nasa.gov
  FTPDIR: /PullDir/0301872638CySfQB
  Ftp Pull Download Links:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
  Down load ZIP file of packaged order:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
  FTPEXPR: 09/12/2009 08:42:31
  MEDIA 1 of 1
  MEDIAID:
  I have been doing this to turn the email into a string
  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  To me that seems a strange thing to do. You could just read the entire
  file as a string:

       f = open(email, 'r')
       s = f.read()

  so FTPHOST isn't the first element, it is just part of a larger
  string. When I turn the email into a string it looks like...
  'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
  'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
  'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
  \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
  load ZIP file of packaged order:\r\n',
  So not sure splitting it like you suggested works in this case.

  Within the file are a list of files, e.g.

  TOTAL FILES: 2
             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
             FILESIZE: 11028908

             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
             FILESIZE: 18975

  and what i want to do is get the ftp address from the file and collect
  these files to pull down from the web e.g.

  MOD13A2.A2007033.h17v08.005.2007101023605.hdf
  MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

  Thus far I have

  #!/usr/bin/env python

  import sys
  import re
  import urllib

  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  m = re.findall(rMOD\.\.h..v..\.005\..\
  \, s)

  ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
  ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
  url = 'ftp://' + ftphost + ftpdir

  for i in xrange(len(m)):

     print i, ':', len(m)
     file1 = m[i][:-4]               # remove xml bit.
     file2 = m[i]

     urllib.urlretrieve(url, file1)
     urllib.urlretrieve(url, file2)

  which works, clearly my match for the MOD13A2* files isn't ideal I
  guess, but they will always occupt those dimensions, so it should
  work. Any suggestions on how to improve this are appreciated.

 Suppose the file contains your example text above. Using 'readlines'
 returns a list of the lines:

   f = open(email, 'r')
   lines = f.readlines()
   lines
 ['TOTAL FILES: 2\n', '\t\tFILENAME:
 MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
 11028908\n', '\n', '\t\tFILENAME:
 MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
 18975\n']

 Using 'str' on that list then converts

Re: Extracting patterns after matching a regex

2009-09-09 Thread MRAB


Mart. wrote:

On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Mart. wrote:

On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Mart. wrote:

On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = FTPHOST: e4ftl01u.ecs.nasa.gov
and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(rFTPHOST, s)
But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin

No need for regex.
s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
return s[9:]
Cheers,
Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(:)[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:
 f = open(email, 'r')
 s = f.read()

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.

Within the file are a list of files, e.g.
TOTAL FILES: 2
   FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
   FILESIZE: 11028908
   FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
   FILESIZE: 18975
and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.
MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
Thus far I have
#!/usr/bin/env python
import sys
import re
import urllib
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(rMOD\.\.h..v..\.005\..\
\, s)
ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir
for i in xrange(len(m)):
   print i, ':', len(m)
   file1 = m[i][:-4]   # remove xml bit.
   file2 = m[i]
   urllib.urlretrieve(url, file1)
   urllib.urlretrieve(url, file2)
which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

  f = open(email, 'r')
  lines = f.readlines()
  lines
['TOTAL FILES: 2\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
11028908\n', '\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
18975\n']

Using 'str' on that list then converts it to s string _representation_
of that list:

  str(lines)
['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
11028908\\n', '\\n',

Re: Extracting patterns after matching a regex

2009-09-09 Thread Al Fansome


Mart. wrote:

On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Mart. wrote:

On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Mart. wrote:

On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = FTPHOST: e4ftl01u.ecs.nasa.gov
and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(rFTPHOST, s)
But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin

No need for regex.
s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
return s[9:]
Cheers,
Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(:)[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:
 f = open(email, 'r')
 s = f.read()

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.

Within the file are a list of files, e.g.
TOTAL FILES: 2
   FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
   FILESIZE: 11028908
   FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
   FILESIZE: 18975
and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.
MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
Thus far I have
#!/usr/bin/env python
import sys
import re
import urllib
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(rMOD\.\.h..v..\.005\..\
\, s)
ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir
for i in xrange(len(m)):
   print i, ':', len(m)
   file1 = m[i][:-4]   # remove xml bit.
   file2 = m[i]
   urllib.urlretrieve(url, file1)
   urllib.urlretrieve(url, file2)
which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

  f = open(email, 'r')
  lines = f.readlines()
  lines
['TOTAL FILES: 2\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
11028908\n', '\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
18975\n']

Using 'str' on that list then converts it to s string _representation_
of that list:

  str(lines)
['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
11028908\\n', '\\n',

Extracting patterns after matching a regex

2009-09-08 Thread Martin

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread MRAB


Martin wrote:

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,


m = re.search(rFTPHOST: (.*), s)
print m.group(1)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread pdpi

On Sep 8, 1:56 pm, Martin mdeka...@gmail.com wrote:
 Hi,

 I need to extract a string after a matching a regular expression. For
 example I have the string...

 s = FTPHOST: e4ftl01u.ecs.nasa.gov

 and once I match FTPHOST I would like to extract
 e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
 problem, I had been trying to match the string using something like
 this:

 m = re.findall(rFTPHOST, s)

 But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
 part. Perhaps I need to find the string and then split it? I had some
 help with a similar problem, but now I don't seem to be able to
 transfer that to this problem!

 Thanks in advance for the help,

 Martin

What you're doing is telling python look for all matches of
'FTPHOST'. That doesn't really help you much, because you pretty much
expect FTPHOST to be there anyway, so finding it means squat. What you
_really_ want to tell it is Look for things shaped like 'FTPHOST:
ftpaddress', and tell me what ftpaddress actually is. Look here:
http://docs.python.org/howto/regex.html#grouping. That'll explain how
to accomplish what you're trying to do.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mark Tolonen



Martin mdeka...@gmail.com wrote in message 
news:5941d8f1-27c0-47d9-8221-d21f07200...@j39g2000yqh.googlegroups.com...

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!


In regular expressions, you match the entire string you are interested in, 
and parenthesize the parts that you want to parse out of that string.  The 
group() method is used to get the whole string with group(0), and each of 
the parenthesized parts with group(n).  An example:



s = FTPHOST: e4ftl01u.ecs.nasa.gov
import re
re.search(r'FTPHOST: (.*)',s).group(0)

'FTPHOST: e4ftl01u.ecs.nasa.gov'

re.search(r'FTPHOST: (.*)',s).group(1)

'e4ftl01u.ecs.nasa.gov'

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 2:15 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Martin wrote:
  Hi,

  I need to extract a string after a matching a regular expression. For
  example I have the string...

  s = FTPHOST: e4ftl01u.ecs.nasa.gov

  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:

  m = re.findall(rFTPHOST, s)

  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!

  Thanks in advance for the help,

 m = re.search(rFTPHOST: (.*), s)
 print m.group(1)

so the .* means to match everything after the regex? That doesn't help
in this case as the string is placed amongst others for example.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 2:21 pm, Mark Tolonen metolone+gm...@gmail.com wrote:
 Martin mdeka...@gmail.com wrote in message

 news:5941d8f1-27c0-47d9-8221-d21f07200...@j39g2000yqh.googlegroups.com...



  Hi,

  I need to extract a string after a matching a regular expression. For
  example I have the string...

  s = FTPHOST: e4ftl01u.ecs.nasa.gov

  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:

  m = re.findall(rFTPHOST, s)

  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!

 In regular expressions, you match the entire string you are interested in,
 and parenthesize the parts that you want to parse out of that string.  The
 group() method is used to get the whole string with group(0), and each of
 the parenthesized parts with group(n).  An example:

  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  import re
  re.search(r'FTPHOST: (.*)',s).group(0)

 'FTPHOST: e4ftl01u.ecs.nasa.gov' re.search(r'FTPHOST: (.*)',s).group(1)

 'e4ftl01u.ecs.nasa.gov'

 -Mark

I see what you mean regarding the groups. Because my string is nested
in amongst others e.g.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

I get the information that follows as well. So is the only way to then
parse the new string? I am trying to construct something that is
fairly robust, so not sure just printing before the \r is the best
solution.

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
  Hi,

  I need to extract a string after a matching a regular expression. For
  example I have the string...

  s = FTPHOST: e4ftl01u.ecs.nasa.gov

  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:

  m = re.findall(rFTPHOST, s)

  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!

  Thanks in advance for the help,

  Martin

 No need for regex.

 s = FTPHOST: e4ftl01u.ecs.nasa.gov
 If FTPHOST in s:
     return s[9:]

 Cheers,

 Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Many thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread pdpi

On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote:
 On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote:





  On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

   No need for regex.

   s = FTPHOST: e4ftl01u.ecs.nasa.gov
   If FTPHOST in s:
       return s[9:]

   Cheers,

   Drea

  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints

  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

  etc. So I need to find a way to stop it before the \r

  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.

  Many thanks

 It is not clear from your post what the input is really like. But just
 guessing this might work:

  print s

 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
 e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
 \n','Ftp Pull Download Links: \r\n'

  re.search(r'FTPHOST: (.*?)\\r',s).group(1)

 'e4ftl01u.ecs.nasa.gov'

Except, I'm assuming, the OP's getting the data from a (windows-
formatted) file, so \r\n shouldn't be escaped in the regex:

 re.search(r'FTPHOST: (.*?)\r',s).group(1)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread MRAB


Mart. wrote:

On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = FTPHOST: e4ftl01u.ecs.nasa.gov
and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(rFTPHOST, s)
But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin

No need for regex.
s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
return s[9:]
Cheers,
Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(:)[1].strip() will work.


It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())


To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()


so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

So not sure splitting it like you suggested works in this case.



--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Mart. wrote:
  On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
  Hi,
  I need to extract a string after a matching a regular expression. For
  example I have the string...
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:
  m = re.findall(rFTPHOST, s)
  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!
  Thanks in advance for the help,
  Martin
  No need for regex.
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]
  Cheers,
  Drea
  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
  etc. So I need to find a way to stop it before the \r
  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.
  If, as Terry suggested, you do have a tuple of strings and the first 
  element has FTPHOST, then s[0].split(:)[1].strip() will work.

  It is an email which contains information before and after the main
  section I am interested in, namely...

  FINISHED: 09/07/2009 08:42:31

  MEDIATYPE: FtpPull
  MEDIAFORMAT: FILEFORMAT
  FTPHOST: e4ftl01u.ecs.nasa.gov
  FTPDIR: /PullDir/0301872638CySfQB
  Ftp Pull Download Links:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
  Down load ZIP file of packaged order:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
  FTPEXPR: 09/12/2009 08:42:31
  MEDIA 1 of 1
  MEDIAID:

  I have been doing this to turn the email into a string

  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())

 To me that seems a strange thing to do. You could just read the entire
 file as a string:

      f = open(email, 'r')
      s = f.read()

  so FTPHOST isn't the first element, it is just part of a larger
  string. When I turn the email into a string it looks like...

  'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
  'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
  'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
  \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
  load ZIP file of packaged order:\r\n',

  So not sure splitting it like you suggested works in this case.



Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(rMOD\.\.h..v..\.005\..\
\, s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[i][:-4]   # remove xml bit.
file2 = m[i]

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Extracting patterns after matching a regex

2009-09-08 Thread Andreas Tawn

   Hi,
 
   I need to extract a string after a matching a regular expression. For
   example I have the string...
 
   s = FTPHOST: e4ftl01u.ecs.nasa.gov
 
   and once I match FTPHOST I would like to extract
   e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
   problem, I had been trying to match the string using something like
   this:
 
   m = re.findall(rFTPHOST, s)
 
   But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
   part. Perhaps I need to find the string and then split it? I had some
   help with a similar problem, but now I don't seem to be able to
   transfer that to this problem!
 
   Thanks in advance for the help,
 
   Martin
 
  No need for regex.
 
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]
 
  Cheers,
 
  Drea
 
 Sorry perhaps I didn't make it clear enough, so apologies. I only
 presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
 thought this easily encompassed the problem. The solution presented
 works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
 when I used this on the actual file I am trying to parse I realised it
 is slightly more complicated as this also pulls out other information,
 for example it prints
 
 e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
 
 etc. So I need to find a way to stop it before the \r
 
 slicing the string wouldn't work in this scenario as I can envisage a
 situation where the string lenght increases and I would prefer not to
 keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element 
has FTPHOST, then s[0].split(:)[1].strip() will work.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

   No need for regex.

   s = FTPHOST: e4ftl01u.ecs.nasa.gov
   If FTPHOST in s:
       return s[9:]

   Cheers,

   Drea

  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints

  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

  etc. So I need to find a way to stop it before the \r

  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.

 If, as Terry suggested, you do have a tuple of strings and the first element 
 has FTPHOST, then s[0].split(:)[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

So not sure splitting it like you suggested works in this case.

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Re: Extracting patterns after matching a regex

2009-09-08 Thread Dave Angel


Mart. wrote:

snip
I have been doing this to turn the email into a string

email =ys.argv[1]
f =open(email, 'r')
s =str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
snip
  


The mistake I see is trying to turn a list into a string, just so you 
can try to parse it back again.  Just write a loop that iterates through 
the list that readlines() returns.


DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Terry Reedy


Mart. wrote:

On Sep 8, 2:15 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Martin wrote:

Hi,
I need to extract a string after a matching a regular expression.


Whether or not you need re is an issue to be determined.

 For example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov
and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov.


Just split the string on ': ' and take the second part.
Or find the position of the space and slice the remainder.


so the .* means to match everything after the regex? That doesn't help
in this case


It helps in the case you presented.

 as the string is placed amongst others for example.


MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',


What you show above is a tuple of strings. Scan the members looking for 
s.startswith('FTPHOST:') and apply previous answer.
Or if above is actually meant to be one string (with quotes omitted), 
split in ',' and apply previous answer.


tjr

--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread nn

On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote:
 On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:



   Hi,

   I need to extract a string after a matching a regular expression. For
   example I have the string...

   s = FTPHOST: e4ftl01u.ecs.nasa.gov

   and once I match FTPHOST I would like to extract
   e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
   problem, I had been trying to match the string using something like
   this:

   m = re.findall(rFTPHOST, s)

   But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
   part. Perhaps I need to find the string and then split it? I had some
   help with a similar problem, but now I don't seem to be able to
   transfer that to this problem!

   Thanks in advance for the help,

   Martin

  No need for regex.

  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]

  Cheers,

  Drea

 Sorry perhaps I didn't make it clear enough, so apologies. I only
 presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
 thought this easily encompassed the problem. The solution presented
 works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
 when I used this on the actual file I am trying to parse I realised it
 is slightly more complicated as this also pulls out other information,
 for example it prints

 e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

 etc. So I need to find a way to stop it before the \r

 slicing the string wouldn't work in this scenario as I can envisage a
 situation where the string lenght increases and I would prefer not to
 keep having to change the string.

 Many thanks

It is not clear from your post what the input is really like. But just
guessing this might work:

 print s
'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

 re.search(r'FTPHOST: (.*?)\\r',s).group(1)
'e4ftl01u.ecs.nasa.gov'
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Extracting patterns after matching a regex

2009-09-08 Thread Andreas Tawn

 Hi,
 
 I need to extract a string after a matching a regular expression. For
 example I have the string...
 
 s = FTPHOST: e4ftl01u.ecs.nasa.gov
 
 and once I match FTPHOST I would like to extract
 e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
 problem, I had been trying to match the string using something like
 this:
 
 m = re.findall(rFTPHOST, s)
 
 But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
 part. Perhaps I need to find the string and then split it? I had some
 help with a similar problem, but now I don't seem to be able to
 transfer that to this problem!
 
 Thanks in advance for the help,
 
 Martin

No need for regex.

s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
return s[9:]

Cheers,

Drea
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread MRAB


Mart. wrote:

On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:

Mart. wrote:

On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = FTPHOST: e4ftl01u.ecs.nasa.gov
and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(rFTPHOST, s)
But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin

No need for regex.
s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
return s[9:]
Cheers,
Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(:)[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:

 f = open(email, 'r')
 s = f.read()


so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.




Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(rMOD\.\.h..v..\.005\..\
\, s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[i][:-4]   # remove xml bit.
file2 = m[i]

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.


Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

 f = open(email, 'r')
 lines = f.readlines()
 lines
['TOTAL FILES: 2\n', '\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 
11028908\n', '\n', '\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 
18975\n']


Using 'str' on that list then converts it to s string _representation_
of that list:

 str(lines)
['TOTAL FILES: 2\\n', '\\t\\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: 
11028908\\n', '\\n', '\\t\\tFILENAME:

Re: Extracting patterns after matching a regex

2009-09-08 Thread nn

On Sep 8, 10:27 am, pdpi pdpinhe...@gmail.com wrote:
 On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote:



  On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote:

   On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

 Hi,

 I need to extract a string after a matching a regular expression. For
 example I have the string...

 s = FTPHOST: e4ftl01u.ecs.nasa.gov

 and once I match FTPHOST I would like to extract
 e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
 problem, I had been trying to match the string using something like
 this:

 m = re.findall(rFTPHOST, s)

 But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
 part. Perhaps I need to find the string and then split it? I had some
 help with a similar problem, but now I don't seem to be able to
 transfer that to this problem!

 Thanks in advance for the help,

 Martin

No need for regex.

s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
    return s[9:]

Cheers,

Drea

   Sorry perhaps I didn't make it clear enough, so apologies. I only
   presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
   thought this easily encompassed the problem. The solution presented
   works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
   when I used this on the actual file I am trying to parse I realised it
   is slightly more complicated as this also pulls out other information,
   for example it prints

   e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
   'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
   0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

   etc. So I need to find a way to stop it before the \r

   slicing the string wouldn't work in this scenario as I can envisage a
   situation where the string lenght increases and I would prefer not to
   keep having to change the string.

   Many thanks

  It is not clear from your post what the input is really like. But just
  guessing this might work:

   print s

  'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
  \n','Ftp Pull Download Links: \r\n'

   re.search(r'FTPHOST: (.*?)\\r',s).group(1)

  'e4ftl01u.ecs.nasa.gov'

 Except, I'm assuming, the OP's getting the data from a (windows-
 formatted) file, so \r\n shouldn't be escaped in the regex:

  re.search(r'FTPHOST: (.*?)\r',s).group(1)



I am just playing the guessing game like everybody else here. Since
the OP didn't use re.DOTALL and was getting more than one line for .*
I assumed that the \n was quite literally '\' and 'n'.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread nn

On Sep 8, 10:25 am, Mart. mdeka...@gmail.com wrote:
 On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote:



  On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote:

   On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

 Hi,

 I need to extract a string after a matching a regular expression. For
 example I have the string...

 s = FTPHOST: e4ftl01u.ecs.nasa.gov

 and once I match FTPHOST I would like to extract
 e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
 problem, I had been trying to match the string using something like
 this:

 m = re.findall(rFTPHOST, s)

 But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
 part. Perhaps I need to find the string and then split it? I had some
 help with a similar problem, but now I don't seem to be able to
 transfer that to this problem!

 Thanks in advance for the help,

 Martin

No need for regex.

s = FTPHOST: e4ftl01u.ecs.nasa.gov
If FTPHOST in s:
    return s[9:]

Cheers,

Drea

   Sorry perhaps I didn't make it clear enough, so apologies. I only
   presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
   thought this easily encompassed the problem. The solution presented
   works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
   when I used this on the actual file I am trying to parse I realised it
   is slightly more complicated as this also pulls out other information,
   for example it prints

   e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
   'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
   0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

   etc. So I need to find a way to stop it before the \r

   slicing the string wouldn't work in this scenario as I can envisage a
   situation where the string lenght increases and I would prefer not to
   keep having to change the string.

   Many thanks

  It is not clear from your post what the input is really like. But just
  guessing this might work:

   print s

  'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
  \n','Ftp Pull Download Links: \r\n'

   re.search(r'FTPHOST: (.*?)\\r',s).group(1)

  'e4ftl01u.ecs.nasa.gov'

 Hi,

 That does work. So the \ escapes the \r, does this tell it to stop
 when it reaches the \r?

 Thanks

Indeed.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread nn

On Sep 8, 11:19 am, Dave Angel da...@ieee.org wrote:
 Mart. wrote:
  snip
  I have been doing this to turn the email into a string

  email =ys.argv[1]
  f =open(email, 'r')
  s =str(f.readlines())

  so FTPHOST isn't the first element, it is just part of a larger
  string. When I turn the email into a string it looks like...

  'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
  'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
  'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
  \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
  load ZIP file of packaged order:\r\n',
  snip

 The mistake I see is trying to turn a list into a string, just so you
 can try to parse it back again.  Just write a loop that iterates through
 the list that readlines() returns.

 DaveA

No kidding.

Instead of this:
s = str(f.readlines())

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir


I would have possibly done something like this (not tested):
lines = f.readlines()
header={}
for row in lines:
key,sep,value = row.partition(':')[2].rstrip()
header[key.lower()]=value
url = 'ftp://' + header['ftphost'] + header['ftpdir']
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread nn

On Sep 8, 12:16 pm, nn prueba...@latinmail.com wrote:
 On Sep 8, 11:19 am, Dave Angel da...@ieee.org wrote:



  Mart. wrote:
   snip
   I have been doing this to turn the email into a string

   email =ys.argv[1]
   f =open(email, 'r')
   s =str(f.readlines())

   so FTPHOST isn't the first element, it is just part of a larger
   string. When I turn the email into a string it looks like...

   'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
   'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
   'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
   \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
   load ZIP file of packaged order:\r\n',
   snip

  The mistake I see is trying to turn a list into a string, just so you
  can try to parse it back again.  Just write a loop that iterates through
  the list that readlines() returns.

  DaveA

 No kidding.

 Instead of this:
 s = str(f.readlines())

 ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
 ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
 url = 'ftp://' + ftphost + ftpdir

 I would have possibly done something like this (not tested):
 lines = f.readlines()
 header={}
 for row in lines:
     key,sep,value = row.partition(':')[2].rstrip()
     header[key.lower()]=value
 url = 'ftp://' + header['ftphost'] + header['ftpdir']

Well I said not tested that would be of course:
lines = f.readlines()
header={}
for row in lines:
key,sep,value = row.partition(':')
header[key.lower()]=value.rstrip()
url = 'ftp://' + header['ftphost'] + header['ftpdir']

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 4:33 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 Mart. wrote:
  On Sep 8, 3:53 pm, MRAB pyt...@mrabarnett.plus.com wrote:
  Mart. wrote:
  On Sep 8, 3:14 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:
  Hi,
  I need to extract a string after a matching a regular expression. For
  example I have the string...
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  and once I match FTPHOST I would like to extract
  e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
  problem, I had been trying to match the string using something like
  this:
  m = re.findall(rFTPHOST, s)
  But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
  part. Perhaps I need to find the string and then split it? I had some
  help with a similar problem, but now I don't seem to be able to
  transfer that to this problem!
  Thanks in advance for the help,
  Martin
  No need for regex.
  s = FTPHOST: e4ftl01u.ecs.nasa.gov
  If FTPHOST in s:
      return s[9:]
  Cheers,
  Drea
  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints
  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
  etc. So I need to find a way to stop it before the \r
  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.
  If, as Terry suggested, you do have a tuple of strings and the first 
  element has FTPHOST, then s[0].split(:)[1].strip() will work.
  It is an email which contains information before and after the main
  section I am interested in, namely...
  FINISHED: 09/07/2009 08:42:31
  MEDIATYPE: FtpPull
  MEDIAFORMAT: FILEFORMAT
  FTPHOST: e4ftl01u.ecs.nasa.gov
  FTPDIR: /PullDir/0301872638CySfQB
  Ftp Pull Download Links:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
  Down load ZIP file of packaged order:
 ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
  FTPEXPR: 09/12/2009 08:42:31
  MEDIA 1 of 1
  MEDIAID:
  I have been doing this to turn the email into a string
  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  To me that seems a strange thing to do. You could just read the entire
  file as a string:

       f = open(email, 'r')
       s = f.read()

  so FTPHOST isn't the first element, it is just part of a larger
  string. When I turn the email into a string it looks like...
  'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
  'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
  'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
  \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
  load ZIP file of packaged order:\r\n',
  So not sure splitting it like you suggested works in this case.

  Within the file are a list of files, e.g.

  TOTAL FILES: 2
             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
             FILESIZE: 11028908

             FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
             FILESIZE: 18975

  and what i want to do is get the ftp address from the file and collect
  these files to pull down from the web e.g.

  MOD13A2.A2007033.h17v08.005.2007101023605.hdf
  MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

  Thus far I have

  #!/usr/bin/env python

  import sys
  import re
  import urllib

  email = sys.argv[1]
  f = open(email, 'r')
  s = str(f.readlines())
  m = re.findall(rMOD\.\.h..v..\.005\..\
  \, s)

  ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
  ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
  url = 'ftp://' + ftphost + ftpdir

  for i in xrange(len(m)):

     print i, ':', len(m)
     file1 = m[i][:-4]               # remove xml bit.
     file2 = m[i]

     urllib.urlretrieve(url, file1)
     urllib.urlretrieve(url, file2)

  which works, clearly my match for the MOD13A2* files isn't ideal I
  guess, but they will always occupt those dimensions, so it should
  work. Any suggestions on how to improve this are appreciated.

 Suppose the file contains your example text above. Using 'readlines'
 returns a list of the lines:

   f = open(email, 'r')
   lines = f.readlines()
   lines
 ['TOTAL FILES: 2\n', '\t\tFILENAME:
 MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
 11028908\n', '\n', '\t\tFILENAME:
 MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
 18975\n']

 Using 'str' on that list then converts

Re: Extracting patterns after matching a regex

2009-09-08 Thread Mart.

On Sep 8, 3:21 pm, nn prueba...@latinmail.com wrote:
 On Sep 8, 9:55 am, Mart. mdeka...@gmail.com wrote:



  On Sep 8, 2:16 pm, Andreas Tawn andreas.t...@ubisoft.com wrote:

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = FTPHOST: e4ftl01u.ecs.nasa.gov

and once I match FTPHOST I would like to extract
e4ftl01u.ecs.nasa.gov. I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(rFTPHOST, s)

But I couldn't then work out how to return the e4ftl01u.ecs.nasa.gov
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

   No need for regex.

   s = FTPHOST: e4ftl01u.ecs.nasa.gov
   If FTPHOST in s:
       return s[9:]

   Cheers,

   Drea

  Sorry perhaps I didn't make it clear enough, so apologies. I only
  presented the example  s = FTPHOST: e4ftl01u.ecs.nasa.gov as I
  thought this easily encompassed the problem. The solution presented
  works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
  when I used this on the actual file I am trying to parse I realised it
  is slightly more complicated as this also pulls out other information,
  for example it prints

  e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
  'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
  0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

  etc. So I need to find a way to stop it before the \r

  slicing the string wouldn't work in this scenario as I can envisage a
  situation where the string lenght increases and I would prefer not to
  keep having to change the string.

  Many thanks

 It is not clear from your post what the input is really like. But just
 guessing this might work:

  print s

 'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
 e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
 \n','Ftp Pull Download Links: \r\n'

  re.search(r'FTPHOST: (.*?)\\r',s).group(1)

 'e4ftl01u.ecs.nasa.gov'

Hi,

That does work. So the \ escapes the \r, does this tell it to stop
when it reaches the \r?

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

2009-09-08 Thread Terry Reedy


Mart. wrote:


If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(:)[1].strip() will work.


It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())


So don't do that. Or rather, scan the list of lines returned by 
.readlines *before* dumping it all into one line.


Or, try the email module. When the email parser returns a 
.message.Message instance, msg['FTPHOST'] will give you what you want.


tjr

--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

RE: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

RE: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

Re: Extracting patterns after matching a regex

28 matches

Site Navigation

Mail list logo

Footer information