Revision: 17539
http://sourceforge.net/p/gate/code/17539
Author: adamfunk
Date: 2014-03-04 21:57:04 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
Better documentation of TR calculations.
Modified Paths:
--------------
userguide/trunk/misc-creole.tex
Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex 2014-03-04 21:48:49 UTC (rev 17538)
+++ userguide/trunk/misc-creole.tex 2014-03-04 21:57:04 UTC (rev 17539)
@@ -3327,8 +3327,11 @@
\end{itemize}
-This type of termbank has only the principal score type.
-%% TODO document the flexible language code matching
+This type of termbank has only the principal score type. When a TfIdf Termbank
+queries this kind for the reference document frequency, two terms are
considered
+a match if both have the same language code or if either has an empty language
+code (in case some applications have been run without language identification
+PRs).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{TfIdf Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3342,30 +3345,29 @@
DocumentFrequencyBank will be constructed from this LR's corpora parameter
and
used here.
\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
- following options for adjusting inverted document frequency (all adjusted so
- they must return a positive value, to prevent division by zero), $g(df)$:
+ following options for adjusting inverted document frequency (all adjusted to
+ prevent division by zero):
\begin{itemize}
- % TODO: add unscaled Logarithmic as below
- % change below to LogarithmicScaled
- \item \emph{Logarithmic} $=\log_{2}(1+n/\mathit{df})$;
- \item \emph{Scaled} $=(1+n)/(1+\mathit{df})$;
- \item \emph{Natural} $=1/(1+\mathit{df})$.
+ \item \emph{LogarithmicScaled}:
$\mathit{idf}=\log_{2}\frac{n}{1+\mathit{df}}$;
+ \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{1+\mathit{df}}$;
+ \item \emph{Scaled}: $\mathit{idf}=\frac{1+n}{1+\mathit{df}}$;
+ \item \emph{Natural}: $\mathit{idf}=\frac{1}{1+\mathit{df}}$.
\end{itemize}
\item \textbf{normalization}: an enum (pull-down) with the following options
for
- normalizing the raw score $s$, where $s=f(\mathit{tf}){\times}g(idf)$:
+ normalizing the raw score $s$, where $s=\mathit{atf}\times\mathit{idf}$:
\begin{itemize}
- \item \emph{None} $=s$ (this may return numbers in a low range);
- \item \emph{Hundred} $=100s$ (this makes the sliders easier to use);
- \item \emph{Sigmoid} $=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw scores
+ \item \emph{None}: $s'=s$ (this may return numbers in a low range);
+ \item \emph{Hundred}: $s'=100s$ (this makes the sliders easier to use);
+ \item \emph{Sigmoid}: $s'=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw
scores
monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
${\infty}{\rightarrow}100$).
\end{itemize}
\item \textbf{tfCalculation}: an enum (pull-down) with the following options
for
- adjusting term frequency $f(\mathit{tf})$:
+ adjusting term frequency:
\begin{itemize}
- \item \emph{Natural} $=\mathit{tf}$;
- \item \emph{Sqrt} $=\sqrt{\mathit{tf}}$;
- \item \emph{Logarithmic} $=1+\log_{2} \mathit{tf}$.
+ \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
+ \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
+ \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
\end{itemize}
\end{itemize}
%%
@@ -3373,15 +3375,15 @@
individual occurrences of the term in the current corpora), whereas
$\mathit{df}$ is the document frequency of the term according to the
DocumentFrequencySource and $n$ is the total number of documents in the
-DocumentFrequencySource. The raw score
-$s=f(\mathit{tm}){\times}g(\mathit{df})$.
+DocumentFrequencySource. The raw (unnormalized) score
+$s=\mathit{atm}\times\mathit{idf}$.
-This type of termbank has five score types: the principal one (normalized), the
-raw score ($s$ above, with the principal name plus the suffix ``.raw''),
-\emph{termFrequency}, \emph{localDocFrequency} (number of documents in the
-current corpora containing the term; not used in the tf.idf calculation), and
-\emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
-\emph{localDocFrequency} if no external \emph{docFreqSource} was specified).
+This type of termbank has five score types: the principal one (normalized, $s'$
+above), the raw score ($s$ above, with the principal name plus the suffix
+``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (number of documents
+in the current corpora containing the term; not used in the tf.idf
calculation),
+and \emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
+\emph{localDocFrequency} if no other \emph{docFreqSource} was specified).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Annotation Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3403,9 +3405,9 @@
\end{itemize}
This type of termbank has four score types: the principal one (normalized), the
-raw score (minimum, maximum, or mean above; with the principal name plus the
-suffix ``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (the last two
-are not used in the calculation).
+raw score (minimum, maximum, or mean, determined as described above; with the
+principal name plus the suffix ``.raw''), \emph{termFrequency}, and
+\emph{localDocFrequency} (the last two are not used in the calculation).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Hyponymy Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3420,16 +3422,16 @@
\end{itemize}
%%
Head information is generated by the multiword JAPE grammar included in the
-application. This LR treats $T_1$ a hyponym of $T_2$ if and only if $T_2$'s
+application. This LR treats $T_1$ a hyponym of $T_0$ if and only if $T_0$'s
head feature's value ends with $T_1$'s head or string feature's value. (This
depends on \emph{head-final} construction of compound nouns, as used in English
-and German.)
+and German.) The raw score $s(T_0)=\mathit{df}\times(1+h)$, where $h$ is the
+number of hyponyms of $T_0$.
This type of termbank has five score types: the principal one (normalized), the
raw score ($s$ above, with the principal name plus the suffix ``.raw''),
-\emph{termFrequency}, \emph{hyponymCount} (number of distinct hyponyms found in
-the current corpora), and \emph{localDocFrequency} (number of documents in the
-current corpora containing the term; not used in other calculations).
+\emph{termFrequency} (not used in the scoring), \emph{hyponymCount} (number of
+distinct hyponyms found in the current corpora), and \emph{localDocFrequency}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{Termbank Score Copier}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs