Re: Term vs. token

Ryan Josal Wed, 20 Apr 2016 10:51:35 -0700

My understanding is a Term is comprised of a "token" and a field.  So then
the documentation makes sense to me - return the count of tokens in a field
for example.  But there were a couple of references you had there that
don't match with that definition, like the number of tokens in a
collection.  Although maybe a Term doesn't have a whole token because what
about token attributes like payload.  I guess I've convinced myself I'm not
entirely clear about it either, but I do feel good about the concept that
tokens don't have fields.  You can tokenize a string without thinking about
fields, and they become terms with fields when you query.


Ryan

On Wednesday, April 20, 2016, Jack Krupansky <[email protected]>
wrote:

> Looking at the Lucene Similarity Javadoc, I see some references to tokens,
> but I am wondering if that is intentional or whether those should really be
> references to terms.
>
> For example:
>
>  *        <li><b>lengthNorm</b> - computed
>  *        when the document is added to the index in accordance with the
> number of tokens
>  *        of this field in the document, so that shorter fields contribute
> more to the score.
>
> I think that should be terms, not tokens.
>
> See:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466
>
> And this:
>
>    * Returns the total number of tokens in the field.
>    * @see Terms#getSumTotalTermFreq()
>    */
>   public long getNumberOfFieldTokens() {
>     return numberOfFieldTokens;
>
> I think that should be terms as well:
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65
>
> And... this:
>
>       numberOfFieldTokens = sumTotalTermFreq;
>
> Where it is clearly starting with terms and treating them as tokens, but
> as in the previous example, I think that should be terms as well.
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128
>
> One last example:
>
>    * Compute any collection-level weight (e.g. IDF, average document
> length, etc) needed for scoring a query.
>    *
>    * @param collectionStats collection-level statistics, such as the
> number of tokens in the collection.
>    * @param termStats term-level statistics, such as the document
> frequency of a term across the collection.
>    * @return SimWeight object with the information this Similarity needs
> to score a query.
>    */
>   public abstract SimWeight computeWeight(CollectionStatistics
> collectionStats, TermStatistics... termStats);
>
> See:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161
>
> In fact, CollectionStatistics uses term, not token:
>
>   /** returns the total number of tokens for this field
>    * @see Terms#getSumTotalTermFreq() */
>   public final long sumTotalTermFreq() {
>     return sumTotalTermFreq;
>
> Oops... it uses both, emphasizing my point about the confusion.
>
> There are other examples as well.
>
> My understanding is that tokens are merely a temporary transition in
> between the original raw source text for a field and then final terms to be
> indexed (or query terms from a parsed and analyzed query.) Yes, during and
> within TokenStream or the analyzer we speak of tokens and intermediate
> string values are referred to as tokens, but once the final string value is
> retrieved from the Token Stream (analyzer), it's a term.
>
> In any case, is there some distinction in any of these cited examples (or
> other examples in this or related code) where "token" is an important
> distinction to be made and "term" is not the proper... term... to be used?
>
> Unless the Lucene project fully intends that the terms token and term are
> absolutely synonymous, a clear distinction should be drawn... I think. Or
> at least the terms should be used consistently, which my last example
> highlights.
>
> Thanks.
>
> -- Jack Krupansky
>

Re: Term vs. token

Reply via email to