[python-win32] Bug When Reading In PSTs

Nick Orr Wed, 17 Feb 2021 13:46:22 -0800

Hi all,

I've been developing a Python tool to ingest and write all emails from a
PST exported from Outlook to individual .html files. The issue is that when
opening the PST in outlook and checking the source information for emails
individually, it includes this specific line:


<meta http-equiv=Content-Type content="text/html; charset=utf-8">

which IS NOT being included when importing the PST with Pywin32 and reading
all the emails in the PST. To see what it looks like in a chunk -

>From Outlook I get:<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="
http://schemas.microsoft.com/office/2004/12/omml"; xmlns="
http://www.w3.org/TR/REC-html40";><head><meta http-equiv=Content-Type
content="text/html; charset=utf-8"><meta name=Generator content="Microsoft
Word 15 (filtered medium)">

What is exported from the tool: <html
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="
http://schemas.microsoft.com/office/2004/12/omml"; xmlns="
http://www.w3.org/TR/REC-html40";><head><meta name=Generator
content="Microsoft Word 15 (filtered medium)">

The contents of the emails are otherwise ENTIRELY identical except for that
one tag.

My code:
-----------------------------------------------------------------------------------------------------

def find_pst_folder(OutlookObj, pst_filepath):
    for Store in OutlookObj.Stores:
        if Store.IsDataFileStore  and Store.FilePath == pst_filepath:
            return Store.GetRootFolder()
    return None

def enumerate_folders(FolderObj):
    for ChildFolder in FolderObj.Folders:
        enumerate_folders(ChildFolder)
    iterate_messages(FolderObj)

def iterate_messages(FolderObj):
    global mycounter2
    global encryptedEmails
    global richPlainEmails
    global totalEmails
    global htmlEmails

    for item in FolderObj.Items:
        totalEmails += 1
        try:
            try:
                body_content = item.HTMLbody
                mysubject = item.Subject
                writeToFile(body_content, exportPath, mysubject)
                mycounter2 = mycounter2 + 1
                htmlEmails += 1
            except AttributeError:
                #print('Non HTML formatted email, passing')
                richPlainEmails += 1
                pass
        except Exception as e:
            encryptedEmails += 1
            pass

def writeToFile(messageHTML, path, mysubject):
    global mycounter2
    filename = '\htmloutput' + str(mycounter2) + '.html'

    #Check if email is rich or plain text first (only HTML emails are
desired)
    if '<!-- Converted from text/plain format -->' in messageHTML or '<!--
Converted from text/rtf format -->' in messageHTML:
        raise AttributeError()

    else:
        file = open(path + filename, "x", encoding='utf-8')
        try:
            messageHTML = regex.sub('\r\n', '\n', messageHTML)
            file.write(messageHTML)

        #Handle any potential unexpected Unicode error
        except Exception as e:
            print('Exception: ' , e)
            try:
                #Prints email subject to more easily find the offending
email
                print('Subject: ', mysubject)
                print(mycounter2)
                file.write(messageHTML)
            except Exception as e:
                print('Tried utf decode: ', e)

        file.close()

htmlEmails = 0
encryptedEmails = 0
totalEmails = 0
richPlainEmails = 0
filenameCount = 1
mycounter2 = 1

#Adjusting name of PST location to be readable
selectedPST = str(selectedPST.replace('/', '\\'))
print('\nRunning:' , selectedPST)
outlook.AddStore(selectedPST)
PSTFolderObj = find_pst_folder(outlook, selectedPST)


-----------------------------------------------------------------------------------------------------

Because the emails otherwise are identical, I can only assume this is being
done by the library. I'm wondering if there's a reason that meta tag is
excluded, or if it's a bug in PyWin32?

Thanks for any input,
-Nick

_______________________________________________
python-win32 mailing list
python-win32@python.org
https://mail.python.org/mailman/listinfo/python-win32

[python-win32] Bug When Reading In PSTs

Reply via email to