There's an overridable default of 10,000 tokens, that's the first place I'd look. Forget just how to set it to a higher value....
Best Erick. P.S. Please don't hit reply to a message and change the title, but start an e-mail fresh. See: http://people.apache.org/~hossman/#threadhijack On Thu, Feb 21, 2013 at 11:59 AM, Mark Wilson <m...@sanger.ac.uk> wrote: > I am having an issue with an old Search Application we are using. > > We have a Search App (using Lucene 3.0.2) that queries an index generated > by > Nutch 1.3. There is a really long page (approx 124kb ) that is crawled and > inserted into the index, but when I search for it, (using a web-based > application based on Lucene 3.0.2) only the top ~20% of the page content is > coming back with results. > > If I open the index up using Luke-1.0.1, I can see all the contents of the > field, but if I search for a term that I know is in there, and it's not in > the top ~20% of the page, it comes back blank. > > So my question is, Is there a size limit for a field in Lucene 3.0.2 > > Regards Mark > > > On 21/02/2013 15:14, "Dyer, James" <james.d...@ingramcontent.com> wrote: > > > Samuel, > > > > Do you think you could write a failing unit test and open a JIRA issue? > Or at > > the least open a JIRA issue with all the details without a test? > > > > James Dyer > > Ingram Content Group > > (615) 213-4311 > > > > > > -----Original Message----- > > From: Samuel García Martínez [mailto:samuelgmarti...@gmail.com] > > Sent: Thursday, February 21, 2013 2:33 AM > > To: java-user@lucene.apache.org > > Subject: Re: possible bug on Spellchecker > > Importance: Low > > > > I'm using Solr 3.6 and DirectSpellchecker is available only on v4+. > > Moreover, in "big" indexes i prefer using sidekick index rather than > > iterating over term dictionary. > > > > > > On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky > > <j...@basetechnology.com>wrote: > > > >> Any reason that you are not using the DirectSpellChecker? > >> > >> See: > >> http://lucene.apache.org/core/**4_0_0/suggest/org/apache/** > >> lucene/search/spell/**DirectSpellChecker.html< > http://lucene.apache.org/core/4 > >> _0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html> > >> > >> -- Jack Krupansky > >> > >> -----Original Message----- From: Samuel García Martínez > >> Sent: Wednesday, February 20, 2013 3:34 PM > >> To: java-user@lucene.apache.org > >> Subject: possible bug on Spellchecker > >> > >> > >> Hi all, > >> > >> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on > lucene > >> Spellchecker) behaviour i think i found a bug when the input is a 6 > letter > >> word: > >> - george > >> - anthem > >> - argued > >> - fluent > >> > >> Due to the getMin() and getMax() the grams indexed for these terms are 3 > >> and 4. So, the fields would be something like this: > >> - for "*george*" > >> > >> - start3: "geo" > >> - start4: "geor" > >> - end3: "rge" > >> - end4: "orge" > >> - 3: "geo", "eor", "org", "rge" > >> - 4: "geor", "eorg", "orge" > >> - for "*anthem*" > >> > >> - start3: "ant" > >> - start4: "anth" > >> - end3: "tem" > >> - end4: "them" > >> > >> The problem shows up when the user swap 3rd a 4th characters, > misspelling > >> the word like this: > >> - geroge > >> - anhtem > >> > >> The queries generated for this terms are: (SHOULD boolean queries) > >> - for "*geroge*" > >> > >> - start3: "ger" > >> - start4: "gero" > >> - end3: "oge" > >> - end4: "roge" > >> - 3: "ger", "ero", "rog", "oge" > >> - 4: "gero", "erog", "roge" > >> - for "*anhtem*" > >> > >> - start3: "anh" > >> - start4: "anht" > >> - end3: "tem" > >> - end4: "htem" > >> - 3: "anh", "nht", "hte", "tem" > >> - 4: "anht", "nhte", "htem" > >> > >> So, as you can see, this kind of misspelling never matches the suitable > >> suggestions although the edit distance is 0.95555556. > >> > >> I think getMin(int l) and getMax(int l) should return 2 and 3, > >> respectively, for l==6. Debugging other values i did not found any > problem > >> with any kind of misspelling. > >> > >> Any thoughts about this? > >> > >> -- > >> Un saludo, > >> Samuel García > >> > >> > ------------------------------**------------------------------**--------- > >> To unsubscribe, e-mail: > >> java-user-unsubscribe@lucene.**apache.org > <java-user-unsubscribe@lucene.apache > >> .org> > >> For additional commands, e-mail: > >> java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org> > >> > >> > > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >