My understanding is a Term is comprised of a "token" and a field. So then the documentation makes sense to me - return the count of tokens in a field for example. But there were a couple of references you had there that don't match with that definition, like the number of tokens in a collection. Although maybe a Term doesn't have a whole token because what about token attributes like payload. I guess I've convinced myself I'm not entirely clear about it either, but I do feel good about the concept that tokens don't have fields. You can tokenize a string without thinking about fields, and they become terms with fields when you query.
Ryan On Wednesday, April 20, 2016, Jack Krupansky <[email protected]> wrote: > Looking at the Lucene Similarity Javadoc, I see some references to tokens, > but I am wondering if that is intentional or whether those should really be > references to terms. > > For example: > > * <li><b>lengthNorm</b> - computed > * when the document is added to the index in accordance with the > number of tokens > * of this field in the document, so that shorter fields contribute > more to the score. > > I think that should be terms, not tokens. > > See: > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 > > And this: > > * Returns the total number of tokens in the field. > * @see Terms#getSumTotalTermFreq() > */ > public long getNumberOfFieldTokens() { > return numberOfFieldTokens; > > I think that should be terms as well: > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 > > And... this: > > numberOfFieldTokens = sumTotalTermFreq; > > Where it is clearly starting with terms and treating them as tokens, but > as in the previous example, I think that should be terms as well. > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 > > One last example: > > * Compute any collection-level weight (e.g. IDF, average document > length, etc) needed for scoring a query. > * > * @param collectionStats collection-level statistics, such as the > number of tokens in the collection. > * @param termStats term-level statistics, such as the document > frequency of a term across the collection. > * @return SimWeight object with the information this Similarity needs > to score a query. > */ > public abstract SimWeight computeWeight(CollectionStatistics > collectionStats, TermStatistics... termStats); > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 > > In fact, CollectionStatistics uses term, not token: > > /** returns the total number of tokens for this field > * @see Terms#getSumTotalTermFreq() */ > public final long sumTotalTermFreq() { > return sumTotalTermFreq; > > Oops... it uses both, emphasizing my point about the confusion. > > There are other examples as well. > > My understanding is that tokens are merely a temporary transition in > between the original raw source text for a field and then final terms to be > indexed (or query terms from a parsed and analyzed query.) Yes, during and > within TokenStream or the analyzer we speak of tokens and intermediate > string values are referred to as tokens, but once the final string value is > retrieved from the Token Stream (analyzer), it's a term. > > In any case, is there some distinction in any of these cited examples (or > other examples in this or related code) where "token" is an important > distinction to be made and "term" is not the proper... term... to be used? > > Unless the Lucene project fully intends that the terms token and term are > absolutely synonymous, a clear distinction should be drawn... I think. Or > at least the terms should be used consistently, which my last example > highlights. > > Thanks. > > -- Jack Krupansky >
