On top of what Shawn rightly said, two things:

1. Try to benchmark yourself (best bet) solution with and without the
shingles. Then you know better and have story with numbers to tell.
2. If you go with the shingles approach, consider removing duplicates with


On Mon, Oct 27, 2014 at 3:11 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/27/2014 6:20 AM, Robust Links wrote:
> > 1) we want to index and search all tokens in a document (i.e. we do not
> > rely on external stores)
> >
> > 2) we need search time to be fast and willing to pay larger indexing time
> > and index size,
> >
> > 3)  be able to search as fast as possible ngrams of 3 tokens or less
> (i.e,
> > unigrams, bigrams and trigrams).
> >
> >
> > To satisfy (1) we used the default
> > <maxFieldLength>2147483647</maxFieldLength> in
> > solrconfig.xml of 3.6.1 index to specify the total number of tokens to
> > index in an article. In solr 4 we are specifying it via the tokenizer in
> > the analyzer chain
> >
> >
> > <tokenizer class="solr.ClassicTokenizerFactory"
> maxTokenLength="2147483647
> > "/>
> >
> >
> > To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
> > shingedFilterFactory in the analyzer chain
> >
> >
> > <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> > maxShingleSize="3”/>
> >
> >
> > This was based on this thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3c856ac15f0808161539p54417df2ga5a6fdfa35889...@mail.gmail.com%3E
> >
> >
> > The open questions we are trying to understand now are:
> >
> >
> > 1) whether shingling is still the best strategy for phrase (ngram) search
> > given our requirements above?
> >
> > 2) if not then what would be a better strategy.
> The maxFieldLength setting is different than maxTokenLength.  The former
> is the number of tokens that are allowed.  The latter is the number of
> characters allowed in *each* token.  Since the value you were using
> should be the default value for maxFieldLength, you don't need it in
> your config.
> As for maxTokenLength, if the older version worked right without that
> setting, you probably don't need it now.  Really long tokens are usually
> useless, unless a later step in the analysis will be breaking it up into
> additional tokens (terms).  It's exceptionally rare that people will use
> or type a "word" that's 256 characters.  I have seen documents that
> exceed the token length on keyword fields where the input is only
> separated by commas -- there are no spaces for the WhiteSpaceTokenizer
> to split on, so a document with a lot of keywords ends up indexing none
> of them because the tokenizer ignores the input due to length.  If it
> had indexed them, they would have been further tokenized by the
> WordDelimiterFilter.
> Shingles may or may not be required to match the way you have described.
>  It all depends on the *exact* nature of your queries.  I haven't
> wrapped my head around the possibilities, so I can't give you an
> example.  Since it's been working on your older index, chances are
> excellent that it will continue to work on the newer index.  Shingles
> can indeed increase search performance, if the conditions are right.
> Search performance in general is better in 4.x than it was in 3.x.
> It's always a good idea to look at this wiki page (and even dive into
> the Lucene javadocs) from time to time in order to determine whether
> there's a better way of doing your analysis:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> It sounds like you've been at this a while, so you probably already know
> this next part, but it would be irresponsible of me to talk about all
> this without mentioning it.  When you change your index analysis, you
> must reindex.
> http://wiki.apache.org/solr/HowToReindex
> Thanks,
> Shawn

Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Reply via email to