Email Archival 101: a General View

Bingham, Kevin Thu, 20 Nov 2008 08:48:59 -0800

I promised a while back to do a generic write-up on selecting an Email
Archival Solution; figured I better finish this set of scribbles before
I shuffle off from the company next week...  If anyone wants to throw
some of this up on a blog somewhere, feel free.  Since I'm finishing
this up in a rush, there are undoubtedly considerations I've forgotten
to include here, and I only strove to include considerations that would
be prevalent to the majority of companies, but this should be a good
start for any company considering archival.



This is written from the perspective of an Exchange Administrator;
Exchange as your core email solution is assumed, but most of the
generalities within could apply to any email solution.  This information
is not definitive nor unbiased; it only represents the empirical
findings of a couple of administrators.

Email Archival has been a hot point in the industry for some time now,
with no real consensus on best-practices or best-in-breed products.
Different parts of industry drive this division in views by having
different requirements.  In general, it's a system of pulling email out
of its native storage system and placing it somewhere else; the
specifics end there, though.  So, when considering Email Archival, the
first thing you need to do is define what it means to your company.  Why
do you want to do archiving?  From there, you should be able to work
into the second big question: What features do you need this tool suite
to have?


There are four primary reasons to want to do archiving: mailbox size
management; legal regulation; litigation response/rules of procedure;
content management.  Deciding upon your primary driver first is key to
being able to understand your path forward - and how closely entangled
your Archival implementation need to be with your Legal department.
(Hint: the answer is almost always: VERY.)
Mailbox size management: are you nuts? You want to take all that email
data out of a system that is designed to manage email data and stick it
somewhere else, increasing the complexity of the whole system and the
number of steps your users need to actually do anything?  Generally
speaking, the tools within Exchange are sufficient for simple mailbox
size management.  If you need additional space, it is almost always
cheaper to simply expand your Exchange databases/storage groups/servers
rather than implement a whole new system on the side.  With the advent
of Exchange 2007, you don't even need to sustain the same level of disk
I/O as previously, so larger, cheaper disks are an option natively to
the email system, rather than with a third party archiving solution.
Legal regulation: an easy call, relatively speaking.  The requirements
of the system should be laid out and decided for you.  You still need to
discuss with your Legal department what additional aspects need to be
considered.
Litigation response: the most involved scenario for legal requirements
gathering.  Every industry, every business, will have a slightly
different focus.  Heavy involvement with the legal department will be
required.  You need to be prepared to tell them what they have forgotten
to consider, or assumed you knew, or you will find the requirements
changing drastically after implementation.
Content Management: It's litigation response, plus.  Plus everything.
This is generally for large organizations trying to get a handle on what
data they have, where it is, and how they want it managed.  Like
litigation response, this generally starts with some very vague ideas
about the requirements and a lack of understanding of just how involved
the decision sets can/need to be.

Usually when initially approached about retention periods for email,
Legal Departments will state that you need to keep everything forever,
or delete everything after 30 days.  In some few cases, one of these
responses might be appropriate, but generally, they are both useless.
In the former, you wind up having so much garbage in the archive that it
is impossible to find anything useful (do you really need to keep the
note from your wife from 8 years ago, asking you to pick up a gallon of
milk on the way home?) while in the latter, there is nothing useful
left, and the users are upset because they can't reference the older
items, either.  So, retention periods probably need to be more
selective.  You need to determine how you want that selectivity to
occur, though.  Only certain users (ie, executives or lawyers, or such)?
Only certain folders in a mailbox?  Only certain content?  Determining
how that selectivity needs to occur will be a driving factor for product
selection.  Do you need to guarantee every item is captured?  Or can you
put some responsibility on the user to classify what must be archived?


There are three basic methods by which data might be moved into the
archive; most vendors offer a choice between two of these: MAPI, Event
sink, or journaling.
MAPI will use a standard MAPI login to the mailbox being archived,
typically from a separate application server.  It might be a continuous
logon or a scheduled one; it will have all the overhead of a MAPI
connection, plus whatever code the vendor is using to filter out items
for archiving, plus overhead to remove items (if applicable), plus
overhead to insert stubs (if applicable).  Suffice to say, this might be
significant in some circumstances.
Event Sink runs on the Exchange server, as an extra step during message
processing.  It is more efficient than MAPI and guaranteed to review
every message (MAPI isn't), but can increase the load significantly and
possibly cause delays in mailflow.
Journaling is a built-in Exchange method of copying all mail sent to
mailboxes in a storage group to a different mailbox.  This can be
combined with a MAPI or Event Sink application, which then runs only
against the journal mailbox instead of every mailbox.  Journaling alone
may meet some organizations' archival needs by itself, without a
third-party vendor addition.  It is disk and processor intensive.


Retrieval methods also vary greatly from vendor to vendor.  Many vendors
offer multiple methods of data retrieval; what will work in your
environment?
Mailbox retrieval is generally accomplished by leaving a "stub" item in
the user's mailbox.  When the stub is opened, the message is retrieved
and presented to the user.  The method of retrieval, however, can also
vary greatly.  Perhaps the stub is a custom form that needs to be
installed in your Organizational Forms, which makes a call to a web
server when open, which retrieves the data from the archive repository
and presents it to the user in the custom form.  Perhaps it posts a
request into an application mailbox, which a service is continually
monitoring and processes, and posts the retrieved item into the user's
mailbox, which then has to be opened.  Perhaps opening the item executes
an Outlook add-in which fetches the item from an archive itself.  There
are many ways to implement stub retrieval, all of which have different
implications for supportability, load balancing, and fault tolerance.
A fat client is simply an installed application on the desktop, which
allows users to access, search, and sometimes manage the archive,
directly.
A web interface should be similar to a fat client, but would be hosted
as a web page somewhere, with the application doing the work there.


Security is a strong concern in some places, not so much in others.  How
does the solution prevent users from retrieving each other's data?  Is
there a way to allow a user to access someone else's data,
intentionally?  Does the archive maintain its own security model, or is
it integrated with Active Directory or other security provider?  If it
is integrated, does that mean it synchronizes a copy and maintains it
own, or does it make security calls against that directory directly?
How is integrity of the archive (ie, are users allowed to delete things
from it or not?) guaranteed?

Integration with other data sources can be a concern for Content
Management implementations, but might be for other implementation
reasons also - and it never hurts to consider the future (will you ever
have need for Content Management?)  A Content Management initiative will
often include - either currently or when you turn your back a month
after implementation - other data sources as well, such as file servers,
SharePoint, or some other databases.  If so, does the solution have an
integrated answer for all platforms?  You may sacrifice some
best-in-breed features by going with a single vendor for all sources,
but you will probably gain cost savings and a single method of
retrieval/search/whatever for all data... which is usually sort of the
point (or one of the points) of a Content Management initiative. 

Topology considerations will be insignificant for small companies, but
of the utmost importance for geographically disparate ones.  Where is
data stored - single point or multiple locations?  Does the application
run in multiple places, or just one?  How does the storage function work
over the WAN?  How does the retrieval work over the WAN?  If there are
multiple repositories, how do they communicate with each other and how
do referrals to other repositories occur, if at all?

Every policy/feature consideration probably has a technical one to go
with it - which you can bet the archive vendors probably won't tell you.
For instance, leaving stub items in the mailbox is a great usability
feature, but one of the tradeoffs is possible performance - it's not the
size of your database that primarily drives performance in Exchange, but
rather the number of items; leaving stubs does nothing to reduce number
of items and will, in fact, swiftly increase it over time.

Offline access is completely unimportant for some companies, but
considered essential at others.  Does the solution have any sort of
offline cache for traveling users?  If so, how does the cache operate -
how is it populated, synchronized, encrypted?  Is there a size cap?
Does its existence on a laptop violate any of the drivers the Legal
Department is pushing in order to run the project in the first place?
For instance, if the vendor is just using their own PST to provide an
offline archive, you can run into a 2GB space limitation on a file that
is weakly encrypted as best, and if a primary driver is to remove PSTs
from your environment, this may not be a viable offline solution for
you.


Finally, Pilot The Solution.  Do NOT pick a vendor just from
discussions, data and presentations.  Get Your Hands Dirty.  My old
company issued RFPs to eight companies and brought three in for testing.
Some things came out in testing that - though probably just fine for
other companies' needs - would have left us very unhappy if we'd just
gone with the vendor who seemed to fit the best from the RFPs.



 



 
This e-mail is intended for the use of the addressee(s) only and may contain 
privileged, confidential, or proprietary information that is exempt from 
disclosure under law.  If you have received this message in error, please 
inform us promptly by reply e-mail, then delete the e-mail and destroy any 
printed copy.   Thank you. 


~ Ninja Email Security with Cloudmark Spam Engine Gets Image Spam ~
~             http://www.sunbeltsoftware.com/Ninja                ~

Email Archival 101: a General View

Reply via email to