Revision: 17536
http://sourceforge.net/p/gate/code/17536
Author: adamfunk
Date: 2014-03-04 19:49:45 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
Working on TR update
Modified Paths:
--------------
userguide/trunk/misc-creole.tex
Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex 2014-03-04 19:05:23 UTC (rev 17535)
+++ userguide/trunk/misc-creole.tex 2014-03-04 19:49:45 UTC (rev 17536)
@@ -3267,8 +3267,9 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{Termbank language resources}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-A \emph{Termbank} is a GATE language resource derived from annotations on one
or
-more GATE corpora. All termbanks have the following init parameters.
+A \emph{Termbank} is a GATE language resource derived from term candidate
+annotations on one or more GATE corpora. All termbanks have the following init
+parameters.
%%
\begin{itemize}
\item \textbf{corpora}: a \texttt{Set<gate.Corpus>} from which the termbank is
@@ -3293,36 +3294,61 @@
The \texttt{Term} class is defined in terms of the term string itself, the
-language code, and the annotation type, so
-\emph{affect}(\emph{english},\emph{Noun}) is distinct from
+language code, and the annotation type, so it is possible to distinguish
+\emph{affect}(\emph{english},\emph{Noun}) from
\emph{affect}(\emph{english},\emph{Verb}), and
-\emph{gift}(\emph{english},\emph{Noun}) is distinct from
+\emph{gift}(\emph{english},\emph{Noun}) from
\emph{gift}(\emph{german},\emph{Noun}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsubsect{DocumentFrequencyBank}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+This termbank counts the number of documents in which each term is found, and
is
+used primarily as input to the TfIdf Termbank. Document frequency can thus be
+determined from a reference corpus in advance and used in subsequent
calcuations
+of tf.idf over other corpora.
+
+A document frequency bank can be constructed from one or more corpora, from one
+or more existing document frequency banks, or from a combination of them, so
+that document frequency counts from different sources can be compiled together.
+It therefore has one additional parameter:
+%%
+\begin{itemize}
+\item \textbf{inputBanks} zero or more other instances of
+ \emph{DocumentFrequencyBank}.
+\end{itemize}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{TfIdf Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This termbank calculates tf.idf scores over all the term candidates in the set
of corpora. It has the following additional init parameters.
%%
\begin{itemize}
-\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
following
- options for inverted document frequency:
+\item \textbf{docFreqSource}
+\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
+ following options for inverted document frequency:
\begin{itemize}
+ \item \emph{Logarithmic} $=\log_{2}(n/df)$;
+ \item \emph{Scaled} $= 1/df$;
\item \emph{Natural} $= 1/df$;
- \item \emph{Logarithmic} $=\log_{2}(n/df)$;
- \item \emph{LogarithmicPlus1} $=1+\log_{2} (n/df)$.
\end{itemize}
+\item \textbf{normalization}
+ \begin{itemize}
+ \item \emph{None}
+ \item \emph{Hundred}
+ \item \emph{Sigmoid}
+ \end{itemize}
\item \textbf{tfCalculation}: an enum (pull-down) with the following options
for
term frequency:
\begin{itemize}
\item \emph{Natural} $=tf$;
+ \item \emph{Sqrt}
\item \emph{Logarithmic} $=1+\log_{2} tf$.
\end{itemize}
\end{itemize}
%%
For these calcutations, $tf$ is the term frequency (number of occurrences of
the
-term in the corpora), $df$ is the document frequency (number of documents
-containing the term), and $n$ is the total number of documents.
+term in the corpora), $df$ is the document frequency according to the
+DocumentFrequencySource, and $n$ is the total number of documents.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Annotation Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3336,6 +3362,12 @@
\texttt{Number} or interpretable as a number.
\item \textbf{mergingMode}: an enum (pull-down menu in the GUI) with the
options
\emph{MINIMUM}, \emph{MEAN}, or \emph{MAXIMUM}.
+\item \textbf{normalization}
+ \begin{itemize}
+ \item \emph{None}
+ \item \emph{Hundred}
+ \item \emph{Sigmoid}
+ \end{itemize}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Hyponymy Termbank}
@@ -3346,10 +3378,16 @@
\begin{itemize}
\item \textbf{inputHeadFeatures} (\texttt{List<String>}): annotation features
on
term candidates containing the head of the expression.
+\item \textbf{normalization}
+ \begin{itemize}
+ \item \emph{None}
+ \item \emph{Hundred}
+ \item \emph{Sigmoid}
+ \end{itemize}
\end{itemize}
%%
Head information is generated by the multiword JAPE grammar included in the
-application. We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$ head
+application. We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$'s head
feature value ends with $T_1$'s head or string feature value.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{Termbank Score Copier}
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs