RE: RegExp and XML

wiggins Fri, 21 Feb 2003 06:35:59 -0800

------------------------------------------------
On Fri, 21 Feb 2003 14:16:11 -0000, "Vincent O' Keeffe" <[EMAIL PROTECTED]> 
wrote:


> Hi there,
> 
> I'm downloading xml files from a mail server using POP3Client. I'm saving these 
>mails as files on a directory. I also parse out certain details from the XML file to 
>include in the filename.
> 
> The only problem is that, when I grab the body of the message using POP3Client's 
>method, it includes mail-related information above and below the actual XML tags. 
> 
> ------=_Part_15_1895070.1044374870502
> Content-Type: text/plain
> Content-Transfer-Encoding: 7bit
> 
> <?xml version="1.0"?>
> ...
> 
> </Order>
> ------=_Part_15_1895070.1044374870502--
> 
> 
> So, I need to remove everything before the opening <?xml string and, again, 
>everything after the closing </Order> tag. I thought about stripping out the first 4 
>and last 4 lines of the file but the messages sometimes arrive clean, and sometimes 
>with this extra info.
> 
> I've trawled newsgroups and the web and haven't been able to come up with any 
>answers. 
> 
> Does anyone have any idea of how to go about this if I assign the body to a variable 
>like so?
> 
> $msgbody = $pop->Body($i)   # $pop being the instantiated POP connection object
> 

The extra lines are MIME multipart boundaries+headers. So you could do the typical, 
check the Content-Type of the main message, if it is multipart then strip the 
boundary, then only match everything between boundaries after the first blank line for 
multiparts, and do regular parsing on non-multiparts... ick... you might try checking 
into the various MIME modules, or consider  using a POP3 client module that will 
handle multiparts, or check to see if POP3Client will do so, one example of a set of 
modules that does is Mail::Box but it is very complex and thorough so may be overkill 
for you....

The other option is to scrap the mail handling completely and just look for the start 
of the XML since it is *supposed* to be proper.  Aka you should be able to look for a 
doctype or the first element, and then you know when to start processing, and given 
the first element you will know when to stop processing.

http://danconia.org

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: RegExp and XML

Reply via email to