[gate-cvs] SF.net SVN: gate:[17384] mimir/branches/5.0/doc

valyt Fri, 21 Feb 2014 06:12:29 -0800

Revision: 17384
          http://sourceforge.net/p/gate/code/17384
Author:   valyt
Date:     2014-02-21 14:11:56 +0000 (Fri, 21 Feb 2014)
Log Message:
-----------
Updated the user guide in preparation for the 5.0 release.
<drumroll />


Modified Paths:
--------------
    mimir/branches/5.0/doc/admin.tex
    mimir/branches/5.0/doc/changes.tex
    mimir/branches/5.0/doc/img/default-front-page.png
    mimir/branches/5.0/doc/img/local-index-created.png
    mimir/branches/5.0/doc/img/local-index-list.png
    mimir/branches/5.0/doc/img/new-local-index.png
    mimir/branches/5.0/doc/indexing.tex
    mimir/branches/5.0/doc/introduction.tex
    mimir/branches/5.0/doc/mimir-guide.pdf
    mimir/branches/5.0/doc/quickstart.tex

Modified: mimir/branches/5.0/doc/admin.tex
===================================================================
--- mimir/branches/5.0/doc/admin.tex    2014-02-21 12:01:34 UTC (rev 17383)
+++ mimir/branches/5.0/doc/admin.tex    2014-02-21 14:11:56 UTC (rev 17384)
@@ -291,33 +291,20 @@
 \Mimir\ applies equally to all three index types.
 
 Each \Mimir\ index has a {\em state}, and the operations that can be performed
-on the index depend on which state it is currently in.  When first created, a
-local index will be in the {\em indexing} state, meaning it is waiting for
-documents to be added to the index.  When all the documents have been added to
-the index an administrator will close the index, putting it into the {\em
-closing} state.  For large indexes the closing process can take several hours,
-and when it is complete the index will enter the {\em searching} state, at
-which point it is available for querying.  The other possible state for a local
-index is {\em failed}, indicating a problem with the index.  Typically a failed
-index will need to be deleted by the administrator.  Thus it is apparent that
-an index cannot be used simultaneously for searching and indexing.  An existing
-set of index files can be imported into a running \Mimir\ instance as a local
-index, which will then immediately be in the {\em searching} state.
+on the index depend on which state it is currently in. Indexes spend most of
+their time in the {\em ready} state, when they can index new documents and
+answer queries. During various operations they may temporarilly be in a
+different state, such as {\em closing} while the index is being shut down,
+typically because the \Mimir{} server is itself being shut down. Sometimes a
+local index is {\em failed}, indicating a problem with the index.  Typically a
+failed index will need to be deleted by the administrator.
 
-Remote indexes inherit their state from the remote server, and federated
-indexes inherit their state by combining the states of their component indexes.
+Remote indexes inherit their state from the remote server, and federated 
indexes
+inherit their state by combining the states of their component indexes.
 A federated index may occasionally appear in the {\em working} state if its
-component indexes are not all in the same state (for example if some of them
-have started closing but others are still in {\em indexing} mode), but the
-working state will usually resolve to a normal state once the component indexes
-have synchronised.
+component indexes are not all in the same state, but the working state will
+usually resolve to a normal state once the component indexes have synchronised.
 
-Note that once a local index has moved from {\em indexing} to {\em closing} to
-{\em searching} it is not possible to add more documents to the same index.
-The suggested way to add to an index is to create a new index to hold the new
-documents, fill it, close it, and then create a federated index consisting of
-the original index plus the new one (or if the original index was itself
-federated, add the new index to the existing federation).
 
 A typical setup for a large-scale indexing task would be to have a number of
 identical ``slave'' servers running \Mimir, each with a single local index.  A
@@ -366,7 +353,7 @@
 %
 The index will be assigned a unique identifier and a new directory will be
 created under the \verb|indexBaseDirectory| you configured earlier to
-hold the index data.  The newly-created index will be in the {\em indexing}
+hold the index data.  The newly-created index will start in the {\em ready}
 state (see Figure~\ref{fig:local-index-created}), ready to receive documents
 for indexing.  For details of how to submit documents to the index, see
 Chapter~\ref{sec:indexing}.
@@ -392,26 +379,12 @@
 \label{fig:local-index-list}
 \end{figure}
 %
-Once all the documents to be indexed have been submitted the index can be
-closed using the {\em Close} link on the index information page.  This will
-change the state of the index to {\em closing} as described above and begin the
-closing process.  The information page will show a live-updating progress bar
-(Figure~\ref{fig:local-index-closing}) giving some indication of the time
-remaining until the index has completely closed.
-%
-\begin{figure}[htb!]
-\begin{center}
-\includegraphics[scale=0.5]{img/local-index-closing}
-\end{center}
-\caption{Closing a local index}
-\label{fig:local-index-closing}
-\end{figure}
-%
+At any time, the index can then be searched using the tools described in
+Chapter~\ref{sec:searching}. Recently added documents only become avaialble for
+searching after a {\em sync-to-disk} has taken place. Sync operations happen
+automtaically at regular intervals, or can be triggered by the user by pressing
+the {\em Sync to Disk} button seen at the bottom of the index information page.
 
-When the closing process is complete (the progress bar reaches 100\%) the index
-will switch into {\em searching} mode, and the index can then be searched using
-the tools described in Chapter~\ref{sec:searching}.
-
 \subsection{Working with Remote and Federated Indexes}
 
 The architecture of \Mimir\ is designed to make working with remote and
@@ -482,17 +455,14 @@
 \subsection{Deleting Indexes}
 
 If an index registered with Mimir is no longer required it can be deleted by
-selecting the {\em Delete} button from the index information page (accessible
-by clicking on the name of the relevant index on the \Mimir\ front page).  For
+selecting the {\em Delete} button from the index information page (accessible 
by
+clicking on the name of the relevant index on the \Mimir\ front page).  For
 remote and federated indexes this simply deletes the ``registration'' of the
 index with \Mimir, which can be easily re-created as above.  For local indexes
 it also offers the option to delete the underlying index files from disk.  If a
-local index in {\em searching} state is deleted without deleting the disk files
-then the index can be re-created later using the {\em import an existing index
-for searching} option from the \Mimir\ front page.  However, if a local index
-in the {\em indexing} or {\em closing} state is deleted without having been
-properly closed then the index files will be unusable and will need to be
-deleted manually.
+local index is deleted without deleting the disk files then the index can be
+re-created later using the {\em import an existing index for searching} option
+from the \Mimir\ front page.
 
 \Mimir\ will not allow the deletion of an index which is currently part of a
 federated index in the same \Mimir\ instance.  To delete such an index, it must

Modified: mimir/branches/5.0/doc/changes.tex
===================================================================
--- mimir/branches/5.0/doc/changes.tex  2014-02-21 12:01:34 UTC (rev 17383)
+++ mimir/branches/5.0/doc/changes.tex  2014-02-21 14:11:56 UTC (rev 17384)
@@ -1,11 +1,13 @@
 This appendix details the main changes in each \Mimir\ release.
 
 
-\section{Version 5.0 (forthcoming)}
+\section{Version 5.0 (February 2014)}
 \begin{itemize}
-  \item \Mimir{} has been upgraded to use MG4J version 5.2.1. Newly created
-  indexes will now be semi-succint, which is the highest performance
-  implementation.
+  \item \Mimir{} indexes are now updateable: new documents can be submitted for
+  indexing at any time.
+  \item \Mimir{} indexes are now live: they can index new documents and serve
+  queries at the same time. Manually {\em closing} indexes before they become
+  searcheable is no longer required.
   \item The {\em mimir-demo} example web application has been removed.
   \item The {\em mimir-cloud} has been modified to make it more suitable as a
   generic example web application.
@@ -20,7 +22,12 @@
   mention. The S-A-H implementations included in the main distribution provide
   default implementations for this functionality, which can be replaced by
   pluggin-in alternative versions.
-  \item \Mimir{} now uses Grails 2.2.3 and GWT 2.5.0 to build the mimir-cloud
+  \item The on-disk format for \Mimir{} indexes has changed. This was required
+  in order to support live indexing  and searching.
+  \item \Mimir{} has been upgraded to use MG4J version 5.2.1. Newly created
+  indexes will now be semi-succint, which is the highest performance
+  implementation.
+  \item \Mimir{} now uses Grails 2.2.3 and GWT 2.6.0 to build the mimir-cloud
   web application.
   \item Bugfix: you can now use a string on the right hand side of a \verb!<!,
   \verb!>!, \verb!<=! and \verb!>=! in annotation queries. This was always

Modified: mimir/branches/5.0/doc/img/default-front-page.png
===================================================================
(Binary files differ)

Modified: mimir/branches/5.0/doc/img/local-index-created.png
===================================================================
(Binary files differ)

Modified: mimir/branches/5.0/doc/img/local-index-list.png
===================================================================
(Binary files differ)

Modified: mimir/branches/5.0/doc/img/new-local-index.png
===================================================================
(Binary files differ)

Modified: mimir/branches/5.0/doc/indexing.tex
===================================================================
--- mimir/branches/5.0/doc/indexing.tex 2014-02-21 12:01:34 UTC (rev 17383)
+++ mimir/branches/5.0/doc/indexing.tex 2014-02-21 14:11:56 UTC (rev 17384)
@@ -206,8 +206,8 @@
 
 \subsection*{Direct Indexes}
 \label{sec:direct-indexes}
-Starting with version 5, \Mimir{} can build direct indexes as well as inverted
-ones. By default only inverted indexes are created, which are used to associate
+Starting with version $5.0$, \Mimir{} can build direct indexes as well as
+inverted ones. By default only inverted indexes are created, which are used to 
associate
 terms to documents. Direct indexes encode the inverse relation from documents 
to
 terms, hence a direct index can be used to find out which terms occur in any
 given document.
@@ -238,11 +238,9 @@
 Note that direct indexes can only be enabled at the level of a {\tt index}
 element in the template, and not for individual annotation types.
 
-Direct indexes are built during the closing of the index, so enabling them will
-slightly increase the time taken for the close operation, but will not affect
-the indexing time. They are stored in separate files from the default indirect
-indexes, so they will not affect the functionality that does not require direct
-indexes at all.
+Direct indexes are stored in separate files from the default indirect indexes,
+so they will not affect the functionality that does not require direct indexes
+at all.
 
 Direct indexes can currently only be searched via the Java API provided by the
 {\tt gate.mimir.search.terms} package.

Modified: mimir/branches/5.0/doc/introduction.tex
===================================================================
--- mimir/branches/5.0/doc/introduction.tex     2014-02-21 12:01:34 UTC (rev 
17383)
+++ mimir/branches/5.0/doc/introduction.tex     2014-02-21 14:11:56 UTC (rev 
17384)
@@ -80,3 +80,86 @@
 implementation used by \Mimir\ is based on ORDI and
 OWLIM\footnote{See
 \url{http://www.ontotext.com/ordi/} and \url{http://www.ontotext.com/owlim/}.}.
+
+\section{Core Concepts}
+
+\Mimir{} provides indexing infrastrucutre for annotated
+GATE\footnote{\url{http://gate.ac.uk}} documents. Users can start a \Mimir{}
+server, submit documents to it for indexing, and execute queries against the 
set
+of indexed documents. 
+
+A \Mimir{} index is a composite of multiple sub-indexes, which are defined in
+the {\em index template} that needs to be provided by the user when a new
+\Mimir{} index is created (see Section~\ref{sec:indexing:templates} for
+details).
+
+{\bf Token Indexes} are sub-indexes that store the information associated with 
+\verb!{Token}! annotations. These are provide a way to index the document
+content. \Mimir{} does not directly index the document text. Instead it uses 
the
+sequence of \verb!{Token}! annotation to construct a representation of the
+document text. This provides more flexibility: if the user chooses to index the
+{\tt string} feature of the tokens, that is equivalent to indexing the document
+text. Alternatively, the user could chose to pre-process their document with 
the
+GATE Morphological Analyser, and instead index the morphological roots of each
+token. This normalises the representation of words (by eliminating inflections)
+and allows different forms of the same word to be matched (e.g. {\em house} and
+{\em houses}). This is similar to stemming/lemmatising, a process traditionally
+employed in Information Retrieval, but it is more advanced and linguistically
+sophisticated, and allows matching e.g. {\em be}, {\em was}, {\em are} with
+each-other, which stemming would not be capable to.
+
+Beside allowing the user to choose which token feature should be indexed,
+\Mimir{} also allows multiple token features to be indexed in parallel
+sub-indexes. The user can actually choose to index {\bf both} the token string
+and morphological root. In that case, the feature mentioned first in the
+{\em index template} becomes the default token feature. To search on any of the
+other token features, queries need to specify which feature they want to target
+(see Section~\ref{sec:string-query} for details).
+
+{\bf Annotation Indexes} are the other type of \Mimir{} sub-index. They are 
used
+to index information about annotations on the document. Which annotations 
should
+be indexed is described in the {\em index template}.
+
+Both token and annotation indexes can be configured to also use {\bf direct
+indexes}. Direct can be used to perform searches for terms starting from
+documents, for eaxmple finding the most frequently occurring word (or
+annotation) in a set of documents. This functionality is only available from 
the
+Java API and cannot be directly accessed by the system users via the web
+interface. More details can be found in Section~\ref{sec:direct-indexes}.
+
+\section{\Mimir{} Lifecycle}
+
+In vesions prior to $5.0$, a \Mimir{} index would start its existence in {\em
+indexing} mode, when it would accept new documents for indexing. When all the
+documents had been indexed, the index would need to be {\em closed}, which 
would
+switch its operation mode to {\em searching}, and the index would then be able
+to answer queries. Once closed, and index could not accept any further 
docuemnts
+for indexing. Starting with version $5.0$, a Mímir index is continually
+accepting documents to be indexed and can answer queries that address the
+currently indexed document set. From being sent to \Mimir{} for indexing to
+becoming avaialble for search, documents go through several stages, which we
+describe next.
+
+Documents submitted for indexing are initially accumulated in RAM, during
+which time they are not available for being searched. A {\em sync-to-disk}
+operation writes all the documents currently in RAM to disk, in the form of an
+{\em index batch}, after which the docuemnts can be searched. Sync-to-disk
+operations happen automatically when too much document data has been 
accumulated
+in RAM, or after a given time interval has passed since the last sync.
+Alternatively, the user can also trigger a sync operation from index admin web
+interface.
+
+Every sync-to-disk operation causes a new index {\em batch} to be created.
+All the batches are merged into a index cluster which is then used to serve
+queries. If the number of clusters gets too large, it can harm efficiency or 
the
+system can run into problems due to too large a number of files being open. To
+avoid this, the index batches can be compacted into a single batch. \Mimir{}
+indexes will automatically do that once the number of batches exceeds a certain
+threshold (which can be modified via API calls).
+
+In order to keep its consistency, a Mímir index {\bf must} be closed orderly
+before the mimir server process is shut down. Shutting down the \Mimir{} server
+(e.g. the {\tt mimir-cloud} web application) will automatically close all
+currently open indexes. Users should never forcefully destroy the mimir server
+process, as that would not allow the close operations to be performed, which 
can
+lead to data loss, or it can corrupt existing indexes.
\ No newline at end of file

Modified: mimir/branches/5.0/doc/mimir-guide.pdf
===================================================================
(Binary files differ)

Modified: mimir/branches/5.0/doc/quickstart.tex
===================================================================
--- mimir/branches/5.0/doc/quickstart.tex       2014-02-21 12:01:34 UTC (rev 
17383)
+++ mimir/branches/5.0/doc/quickstart.tex       2014-02-21 14:11:56 UTC (rev 
17384)
@@ -119,11 +119,13 @@
     during the previous step. The \Mimir{} Indexing PR instance will make
     sure the annotated documents are sent for indexing to your new Local Index.
   \end{enumerate}
-  \item Go back to the index details page in your browser. Click the {\em 
Close}
-  button, and wait for the index to finish closing. When done, go back to the
-  main administration page. 
-  \item {\bf You can now search the new index} by clicking the {\em search} 
link
-  next to the name of your new index.
+  \item {\bf Search the new index:} as soon as the index has started indexing
+  document, you can used it to search by clicking the {\em search} link next to
+  the name of your new index. There is a time delay between documents being
+  submitted for indexing and them being available for searching. YOu can speed
+  this process up by manualy performing a {\em sync-to-disk} operation or by
+  reducing the time interval between batches. Both of these actions are
+  available on the index administration page.
 \end{enumerate}
 
 To shut down the running web application, create a file named {\tt

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

[gate-cvs] SF.net SVN: gate:[17384] mimir/branches/5.0/doc

Reply via email to