I am having an issue with an old Search Application we are using. We have a Search App (using Lucene 3.0.2) that queries an index generated by Nutch 1.3. There is a really long page (approx 124kb ) that is crawled and inserted into the index, but when I search for it, (using a web-based application based on Lucene 3.0.2) only the top ~20% of the page content is coming back with results.
If I open the index up using Luke-1.0.1, I can see all the contents of the field, but if I search for a term that I know is in there, and it's not in the top ~20% of the page, it comes back blank. So my question is, Is there a size limit for a field in Lucene 3.0.2 Regards Mark On 21/02/2013 15:14, "Dyer, James" <james.d...@ingramcontent.com> wrote: > Samuel, > > Do you think you could write a failing unit test and open a JIRA issue? Or at > the least open a JIRA issue with all the details without a test? > > James Dyer > Ingram Content Group > (615) 213-4311 > > > -----Original Message----- > From: Samuel García Martínez [mailto:samuelgmarti...@gmail.com] > Sent: Thursday, February 21, 2013 2:33 AM > To: java-user@lucene.apache.org > Subject: Re: possible bug on Spellchecker > Importance: Low > > I'm using Solr 3.6 and DirectSpellchecker is available only on v4+. > Moreover, in "big" indexes i prefer using sidekick index rather than > iterating over term dictionary. > > > On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky > <j...@basetechnology.com>wrote: > >> Any reason that you are not using the DirectSpellChecker? >> >> See: >> http://lucene.apache.org/core/**4_0_0/suggest/org/apache/** >> lucene/search/spell/**DirectSpellChecker.html<http://lucene.apache.org/core/4 >> _0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html> >> >> -- Jack Krupansky >> >> -----Original Message----- From: Samuel García Martínez >> Sent: Wednesday, February 20, 2013 3:34 PM >> To: java-user@lucene.apache.org >> Subject: possible bug on Spellchecker >> >> >> Hi all, >> >> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene >> Spellchecker) behaviour i think i found a bug when the input is a 6 letter >> word: >> - george >> - anthem >> - argued >> - fluent >> >> Due to the getMin() and getMax() the grams indexed for these terms are 3 >> and 4. So, the fields would be something like this: >> - for "*george*" >> >> - start3: "geo" >> - start4: "geor" >> - end3: "rge" >> - end4: "orge" >> - 3: "geo", "eor", "org", "rge" >> - 4: "geor", "eorg", "orge" >> - for "*anthem*" >> >> - start3: "ant" >> - start4: "anth" >> - end3: "tem" >> - end4: "them" >> >> The problem shows up when the user swap 3rd a 4th characters, misspelling >> the word like this: >> - geroge >> - anhtem >> >> The queries generated for this terms are: (SHOULD boolean queries) >> - for "*geroge*" >> >> - start3: "ger" >> - start4: "gero" >> - end3: "oge" >> - end4: "roge" >> - 3: "ger", "ero", "rog", "oge" >> - 4: "gero", "erog", "roge" >> - for "*anhtem*" >> >> - start3: "anh" >> - start4: "anht" >> - end3: "tem" >> - end4: "htem" >> - 3: "anh", "nht", "hte", "tem" >> - 4: "anht", "nhte", "htem" >> >> So, as you can see, this kind of misspelling never matches the suitable >> suggestions although the edit distance is 0.95555556. >> >> I think getMin(int l) and getMax(int l) should return 2 and 3, >> respectively, for l==6. Debugging other values i did not found any problem >> with any kind of misspelling. >> >> Any thoughts about this? >> >> -- >> Un saludo, >> Samuel García >> >> ------------------------------**------------------------------**--------- >> To unsubscribe, e-mail: >> java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache >> .org> >> For additional commands, e-mail: >> java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org> >> >> > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org