Revision: 17536
          http://sourceforge.net/p/gate/code/17536
Author:   adamfunk
Date:     2014-03-04 19:49:45 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
Working on TR update

Modified Paths:
--------------
    userguide/trunk/misc-creole.tex

Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex     2014-03-04 19:05:23 UTC (rev 17535)
+++ userguide/trunk/misc-creole.tex     2014-03-04 19:49:45 UTC (rev 17536)
@@ -3267,8 +3267,9 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{Termbank language resources}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-A \emph{Termbank} is a GATE language resource derived from annotations on one 
or
-more GATE corpora.  All termbanks have the following init parameters.
+A \emph{Termbank} is a GATE language resource derived from term candidate
+annotations on one or more GATE corpora.  All termbanks have the following init
+parameters.
 %%
 \begin{itemize}
 \item \textbf{corpora}: a \texttt{Set<gate.Corpus>} from which the termbank is
@@ -3293,36 +3294,61 @@
 
 
 The \texttt{Term} class is defined in terms of the term string itself, the
-language code, and the annotation type, so
-\emph{affect}(\emph{english},\emph{Noun}) is distinct from
+language code, and the annotation type, so it is possible to distinguish
+\emph{affect}(\emph{english},\emph{Noun}) from
 \emph{affect}(\emph{english},\emph{Verb}), and
-\emph{gift}(\emph{english},\emph{Noun}) is distinct from
+\emph{gift}(\emph{english},\emph{Noun}) from
 \emph{gift}(\emph{german},\emph{Noun}).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsubsect{DocumentFrequencyBank}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+This termbank counts the number of documents in which each term is found, and 
is
+used primarily as input to the TfIdf Termbank.  Document frequency can thus be
+determined from a reference corpus in advance and used in subsequent 
calcuations
+of tf.idf over other corpora.
+
+A document frequency bank can be constructed from one or more corpora, from one
+or more existing document frequency banks, or from a combination of them, so
+that document frequency counts from different sources can be compiled together.
+It therefore has one additional parameter:
+%%
+\begin{itemize}
+\item \textbf{inputBanks} zero or more other instances of
+  \emph{DocumentFrequencyBank}.
+\end{itemize}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{TfIdf Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 This termbank calculates tf.idf scores over all the term candidates in the set
 of corpora.  It has the following additional init parameters.
 %%
 \begin{itemize}
-\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the 
following
-  options for inverted document frequency:
+\item \textbf{docFreqSource}
+\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
+  following options for inverted document frequency:
   \begin{itemize}
+  \item \emph{Logarithmic} $=\log_{2}(n/df)$;
+  \item \emph{Scaled} $= 1/df$;
   \item \emph{Natural} $= 1/df$;
-  \item \emph{Logarithmic} $=\log_{2}(n/df)$;
-  \item \emph{LogarithmicPlus1} $=1+\log_{2} (n/df)$.
   \end{itemize}
+\item \textbf{normalization}
+  \begin{itemize}
+  \item \emph{None}
+  \item \emph{Hundred}
+  \item \emph{Sigmoid}
+  \end{itemize}
 \item \textbf{tfCalculation}: an enum (pull-down) with the following options 
for
   term frequency:
   \begin{itemize}
   \item \emph{Natural} $=tf$;
+  \item \emph{Sqrt} 
   \item \emph{Logarithmic} $=1+\log_{2} tf$.
   \end{itemize}
 \end{itemize}
 %%
 For these calcutations, $tf$ is the term frequency (number of occurrences of 
the
-term in the corpora), $df$ is the document frequency (number of documents
-containing the term), and $n$ is the total number of documents.
+term in the corpora), $df$ is the document frequency according to the
+DocumentFrequencySource, and $n$ is the total number of documents.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Annotation Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3336,6 +3362,12 @@
   \texttt{Number} or interpretable as a number.
 \item \textbf{mergingMode}: an enum (pull-down menu in the GUI) with the 
options
   \emph{MINIMUM}, \emph{MEAN}, or \emph{MAXIMUM}.
+\item \textbf{normalization}
+  \begin{itemize}
+  \item \emph{None}
+  \item \emph{Hundred}
+  \item \emph{Sigmoid}
+  \end{itemize}
 \end{itemize}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Hyponymy Termbank}
@@ -3346,10 +3378,16 @@
 \begin{itemize}
 \item \textbf{inputHeadFeatures} (\texttt{List<String>}): annotation features 
on
   term candidates containing the head of the expression.
+\item \textbf{normalization}
+  \begin{itemize}
+  \item \emph{None}
+  \item \emph{Hundred}
+  \item \emph{Sigmoid}
+  \end{itemize}
 \end{itemize}
 %%
 Head information is generated by the multiword JAPE grammar included in the
-application.  We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$ head
+application.  We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$'s head
 feature value ends with $T_1$'s head or string feature value.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{Termbank Score Copier}

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to