http://www.md.kth.se/~chr/lyx/examples/prosper/layout_by_Weiss/prosper-by-Weiss.layout
Cheers, Raphael
#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass prosper-by-Weiss \options pdf \language english \inputencoding auto \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default
\layout Title Distributed and Paged Suffix Trees for Large Genetic Databases \layout Author Raphael Clifford and Marek Sergot \layout Institution Imperial College, London, UK \layout Slide Overview \layout Itemize Why do we need distributed and paged suffix trees? \layout Itemize What do we mean by a distributed or paged suffix tree? \layout Itemize How can we construct one efficiently? \layout Itemize What can we do with it? \layout Slide Motivation \layout Standard Why do we need distributed and paged suffix trees? \layout Itemize Suffix trees use a lot of memory. \layout Itemize Their construction and querying algorithms have very poor memory locality so secondary storage can on not be used. \layout Itemize Sequencing projects are producing more and more data. \layout Itemize Smarter bioinformatics algorithms are being developed which require more complex data structures such as the suffix tree. \layout Slide Sparse suffix trees \layout Standard A sparse suffix tree (SST) of a string \begin_inset Formula $t$ \end_inset is the compacted trie of a subset of the suffixes of \begin_inset Formula $t$ \end_inset . \layout Itemize Evenly spaced suffixes handled by Karkkainen and Ukkonen in 1996. \layout Itemize Andersson gave a quite different construction algorithm for \begin_inset Quotes eld \end_inset suffix trees on words \begin_inset Quotes erd \end_inset . The suffixes that are to be inserted into the tree are defined by delimiters that define the end of \begin_inset Quotes eld \end_inset words \begin_inset Quotes erd \end_inset in the string. \layout Itemize Andersson's construction algorithm is complicated but is able to cover the case of evenly spaced suffixes as well as some others. However it is unnecessary as we shall see. \layout Slide Sparse suffix trees for distributed computing \layout Itemize We distribute the suffix tree by splitting the text \emph on lexicographically. \layout Itemize Define the suffixes that are to inserted by referring to a set of \emph on valid \emph default suffixes, \begin_inset Formula $V_{z}$ \end_inset , where \begin_inset Formula $z$ \end_inset is a short fixed substing of the text. \begin_inset Formula $V_{z}$ \end_inset contains the start positions of each suffix that has \begin_inset Formula $z$ \end_inset as a prefix. We say that a word in string \begin_inset Formula $t$ \end_inset is \emph on nodal \emph default wrt to a SST if it corresponds to a node in that SST. \layout Itemize Suffix links will in general point across SSTs defined in this way. To perform the construction efficiently we must redefine them. \layout Slide Sparse suffix links \layout Standard We require that sparse suffix links must point to nodes within the same SST and that they are powerful enough to enable linear time construction. \layout Standard \series bold Definition: \layout Standard Consider an input pair \begin_inset Formula $(t,V_{z})$ \end_inset and sparse suffix tree \begin_inset Formula $T=sst(t,V_{z})$ \end_inset . Let \begin_inset Formula $aw$ \end_inset be nodal in \begin_inset Formula $T$ \end_inset and \begin_inset Formula $v$ \end_inset be the longest repeated suffix of \begin_inset Formula $aw$ \end_inset that occurs in \begin_inset Formula $T$ \end_inset . A \emph on sparse \emph default \emph on suffix \emph default \emph on link \emph default or \emph on ssl \emph default is an unlabelled edge from \begin_inset Formula $\overline{aw}$ \end_inset to the \emph on root \emph default if \begin_inset Formula $|v|<|z|$ \end_inset and from \begin_inset Formula $\overline{aw}$ \end_inset to \begin_inset Formula $\overline{v}$ \end_inset , otherwise. \layout Slide Normal suffix tree \layout Standard \align center \begin_inset Graphics filename link.eps scale 50 subcaptionText "The standard suffix tree of $aacacccacacaccacaaa\$$ with standard suffix links.." \end_inset \layout Standard The standard suffix tree of \begin_inset Formula $aacacccacacaccacaaa\$$ \end_inset with standard suffix links. \layout Slide Merged SSTs \layout Standard \align center \begin_inset Graphics filename link.eps scale 50 \end_inset \layout Standard SSTs for \begin_inset Formula $aacacccacacaccacaaa\$$ \end_inset with their respective root nodes. The sparse suffix links are marked with dashed arrows. \layout Slide SST construction \layout Standard In order to construct the SST online we need to be sure which new suffixes might need to be inserted at each turn. \layout Standard \series bold Definition: \layout Standard Consider input pair \begin_inset Formula $(p,V_{z})$ \end_inset . Denote the set valid repeated suffixes of \begin_inset Formula $p$ \end_inset by \begin_inset Formula $R(p,V_{z})$ \end_inset . We define this set to include the empty string, \begin_inset Formula $\epsilon$ \end_inset . Let \begin_inset Formula $\alpha(p)$ \end_inset be the longest suffix in \begin_inset Formula $R(p,V_{z})$ \end_inset . \layout Standard \series bold Example \series default : \layout Standard Consider input string \begin_inset Formula $t=aabaaa$ \end_inset with prefix \begin_inset Formula $p=aabaa$ \end_inset and \begin_inset Formula $V_{aa}=\{1,4,5\}$ \end_inset . \begin_inset Formula $\alpha(p)=aa$ \end_inset and \begin_inset Formula $R(p,V_{aa})=\{ aa,\epsilon\}$ \end_inset . \layout Theorem Consider the input pair \begin_inset Formula $(t,V_{z})$ \end_inset with \begin_inset Formula $p$ \end_inset , a proper prefix of \begin_inset Formula $t$ \end_inset . Let \begin_inset Formula $a$ \end_inset be the character in \begin_inset Formula $t$ \end_inset that directly follows the prefix \begin_inset Formula $p$ \end_inset . Consider also the set, \begin_inset Formula $S$ \end_inset , of valid suffixes, \begin_inset Formula $sa$ \end_inset , for \begin_inset Formula $I[1,|pa|]$ \end_inset such that \begin_inset Formula $|\alpha(p)\geq|sa|>|alpha(pa)|$ \end_inset and \begin_inset Formula $s\in R(p,V_{z})$ \end_inset . \begin_inset Formula $sst(pa,V_{z})=sst(p,V_{z})$ \end_inset \emph on augmented \emph default \emph on by \emph default \begin_inset Formula $S$ \end_inset . \layout Standard \series bold Proof \series default : \layout Standard Let \begin_inset Formula $sa$ \end_inset be a valid suffix for \begin_inset Formula $I[1,|pa|]$ \end_inset . We need to insert this new suffix into the tree if and only if \begin_inset Formula $s\in R(p,V_{z})$ \end_inset but \begin_inset Formula $sa\notin R(pa,V_{z})$ \end_inset ( \begin_inset Formula $s=\epsilon$ \end_inset is a special case). In this case \begin_inset Formula $sa$ \end_inset corresponds to a leaf in \begin_inset Formula $sst(pa,V_{z})$ \end_inset but \begin_inset Formula $s$ \end_inset does not correspond to a leaf in \begin_inset Formula $sst(p,V_{z})$ \end_inset so a new node is necessary. \layout Slide slide \layout Theorem For a input pair \begin_inset Formula $(t,V_{z})$ \end_inset with \begin_inset Formula $|z|\geq1$ \end_inset , \begin_inset Formula $sst(t,V_{z})$ \end_inset can be constructed in \begin_inset Formula $O(n)$ \end_inset time, where \begin_inset Formula $n=|t|$ \end_inset . \layout Standard The algorithm in this work can construct both types of sparse suffix tree plus one mor \layout Standard e important variant using a single \the_end