misc-creole.tex

adamfunk Tue, 04 Mar 2014 13:57:22 -0800

Revision: 17539
          http://sourceforge.net/p/gate/code/17539
Author:   adamfunk
Date:     2014-03-04 21:57:04 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
Better documentation of TR calculations.


Modified Paths:
--------------
    userguide/trunk/misc-creole.tex

Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex     2014-03-04 21:48:49 UTC (rev 17538)
+++ userguide/trunk/misc-creole.tex     2014-03-04 21:57:04 UTC (rev 17539)
@@ -3327,8 +3327,11 @@
 \end{itemize}
 
 
-This type of termbank has only the principal score type.
-%% TODO document the flexible language code matching
+This type of termbank has only the principal score type.  When a TfIdf Termbank
+queries this kind for the reference document frequency, two terms are 
considered
+a match if both have the same language code or if either has an empty language
+code (in case some applications have been run without language identification
+PRs).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{TfIdf Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3342,30 +3345,29 @@
   DocumentFrequencyBank will be constructed from this LR's corpora parameter 
and
   used here.
 \item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
-  following options for adjusting inverted document frequency (all adjusted so
-  they must return a positive value, to prevent division by zero), $g(df)$:
+  following options for adjusting inverted document frequency (all adjusted to
+  prevent division by zero):
   \begin{itemize}
-    % TODO: add unscaled Logarithmic as below
-    % change below to LogarithmicScaled
-  \item \emph{Logarithmic} $=\log_{2}(1+n/\mathit{df})$;
-  \item \emph{Scaled} $=(1+n)/(1+\mathit{df})$;
-  \item \emph{Natural} $=1/(1+\mathit{df})$.
+  \item \emph{LogarithmicScaled}: 
$\mathit{idf}=\log_{2}\frac{n}{1+\mathit{df}}$;
+  \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{1+\mathit{df}}$;
+  \item \emph{Scaled}: $\mathit{idf}=\frac{1+n}{1+\mathit{df}}$;
+  \item \emph{Natural}: $\mathit{idf}=\frac{1}{1+\mathit{df}}$.
   \end{itemize}
 \item \textbf{normalization}: an enum (pull-down) with the following options 
for
-  normalizing the raw score $s$, where $s=f(\mathit{tf}){\times}g(idf)$:
+  normalizing the raw score $s$, where $s=\mathit{atf}\times\mathit{idf}$:
   \begin{itemize}
-  \item \emph{None} $=s$ (this may return numbers in a low range);
-  \item \emph{Hundred} $=100s$ (this makes the sliders easier to use);
-  \item \emph{Sigmoid} $=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw scores
+  \item \emph{None}: $s'=s$ (this may return numbers in a low range);
+  \item \emph{Hundred}: $s'=100s$ (this makes the sliders easier to use);
+  \item \emph{Sigmoid}: $s'=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw 
scores
     monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
     ${\infty}{\rightarrow}100$).
   \end{itemize}
 \item \textbf{tfCalculation}: an enum (pull-down) with the following options 
for
-  adjusting term frequency $f(\mathit{tf})$:
+  adjusting term frequency:
   \begin{itemize}
-  \item \emph{Natural} $=\mathit{tf}$;
-  \item \emph{Sqrt} $=\sqrt{\mathit{tf}}$;
-  \item \emph{Logarithmic} $=1+\log_{2} \mathit{tf}$.
+  \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
+  \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
+  \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
   \end{itemize}
 \end{itemize}
 %%
@@ -3373,15 +3375,15 @@
 individual occurrences of the term in the current corpora), whereas
 $\mathit{df}$ is the document frequency of the term according to the
 DocumentFrequencySource and $n$ is the total number of documents in the
-DocumentFrequencySource.  The raw score
-$s=f(\mathit{tm}){\times}g(\mathit{df})$.
+DocumentFrequencySource.  The raw (unnormalized) score
+$s=\mathit{atm}\times\mathit{idf}$.
 
-This type of termbank has five score types: the principal one (normalized), the
-raw score ($s$ above, with the principal name plus the suffix ``.raw''),
-\emph{termFrequency}, \emph{localDocFrequency} (number of documents in the
-current corpora containing the term; not used in the tf.idf calculation), and
-\emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
-\emph{localDocFrequency} if no external \emph{docFreqSource} was specified).
+This type of termbank has five score types: the principal one (normalized, $s'$
+above), the raw score ($s$ above, with the principal name plus the suffix
+``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (number of documents
+in the current corpora containing the term; not used in the tf.idf 
calculation),
+and \emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
+\emph{localDocFrequency} if no other \emph{docFreqSource} was specified).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Annotation Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3403,9 +3405,9 @@
 \end{itemize}
 
 This type of termbank has four score types: the principal one (normalized), the
-raw score (minimum, maximum, or mean above; with the principal name plus the
-suffix ``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (the last two
-are not used in the calculation).
+raw score (minimum, maximum, or mean, determined as described above; with the
+principal name plus the suffix ``.raw''), \emph{termFrequency}, and
+\emph{localDocFrequency} (the last two are not used in the calculation).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Hyponymy Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3420,16 +3422,16 @@
 \end{itemize}
 %%
 Head information is generated by the multiword JAPE grammar included in the
-application.  This LR treats $T_1$ a hyponym of $T_2$ if and only if $T_2$'s
+application.  This LR treats $T_1$ a hyponym of $T_0$ if and only if $T_0$'s
 head feature's value ends with $T_1$'s head or string feature's value.  (This
 depends on \emph{head-final} construction of compound nouns, as used in English
-and German.)
+and German.)  The raw score $s(T_0)=\mathit{df}\times(1+h)$, where $h$ is the
+number of hyponyms of $T_0$.
 
 This type of termbank has five score types: the principal one (normalized), the
 raw score ($s$ above, with the principal name plus the suffix ``.raw''),
-\emph{termFrequency}, \emph{hyponymCount} (number of distinct hyponyms found in
-the current corpora), and \emph{localDocFrequency} (number of documents in the
-current corpora containing the term; not used in other calculations).
+\emph{termFrequency} (not used in the scoring), \emph{hyponymCount} (number of
+distinct hyponyms found in the current corpora), and \emph{localDocFrequency}.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{Termbank Score Copier}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

[gate-cvs] SF.net SVN: gate:[17539] userguide/trunk/misc-creole.tex

Reply via email to