This is probably what you just found, but for others: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
-- Toby Matejovsky On Wed, Sep 15, 2010 at 12:49 PM, <rick.bullo...@burningskysoftware.com>wrote: > Removing HTML markup is not a trivial task, but luckily, the Apache > Solr team has already created additional analyzers for Lucene that do > what I need (the analysis package in solr has a lot of really good > stuff in it); > > > > I will still need some help from the Neo team to understand how use a > specific analyzer instead of the default one... > > > > Thanks, > > > > Rick > > > > -------- Original Message -------- > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring > last word/term > From: Morten Barklund <[1]mor...@barklund.dk> > Date: Wed, September 15, 2010 12:29 pm > To: Neo4j user discussions <[2]u...@lists.neo4j.org> > Hi > I might be overly simplistic here, but why not lowercase the text, > remove > html markup, then remove all non-word-or-space-characters, store this > as the > stripped version of the text on the node (for de-indexing) and index > this? > /Barklund > On Wed, Sep 15, 2010 at 18:07, > <[3]rick.bullo...@burningskysoftware.com> wrote: > > Actually, it seems like a deeper bug/design flaw in Lucene's > > analyzer/tokenizer. The actual text is HTML text, with <p> and </p> > > wrappers. Lucene somewhat randomly seems to treat the last two words > > as a single token, and in other cases ignore it altogether. The dot > > character screws it up even more, because even if it tokenizes with > the > > dot character, you can't query with it (or at least nothing gets > > returned). > > > > > > > > Hmmm. I really don't want to have to write a tokenizer/analyzer if I > > can avoid it. Seems like a LOT of work. > > > > > > > > Do you have any example code of a custom tokenizer/analyzer we could > > start from? > > > > > > > > Thanks, > > > > > > > > Rick > > > > -------- Original Message -------- > > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring > > last word/term > > From: Mattias Persson <[1][4]matt...@neotechnology.com> > > Date: Wed, September 15, 2010 11:47 am > > To: Neo4j user discussions <[2][5]u...@lists.neo4j.org> > > Couldn't it be that sentences ends with a dot... so "Cheese is good." > > will > > index the words: ["Cheese", "is", "good."] ? Observe the last word > > isn't > > "good", it's "good." with a dot. I know that has messed up some > > searches for > > me at least. You could perhaps override the implementation and > > instantiate > > an Analyzer/Tokenizer which gets rid of such punctuation characters? > > 2010/9/15 <[3][6]rick.bullo...@burningskysoftware.com> > > > Using neo4j-index-1.1 and lucene-core-2.9.2, by the way. > > > > > > > > > > > > > > > > > > -------- Original Message -------- > > > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring > > > last word/term > > > From: Mattias Persson <[1][4][7]matt...@neotechnology.com> > > > Date: Wed, September 15, 2010 10:37 am > > > To: Neo4j user discussions <[2][5][8]u...@lists.neo4j.org> > > > That sounds weird. Look at > > > TestLuceneFulltextIndexService#testSimpleFulltext > > > method, it queries for the last word and it seems to work. > > > Could you provide more info on this? > > > 2010/9/15 <[3][6][9]rick.bullo...@burningskysoftware.com> > > > > I've noticed that when indexing full text, the last term/word is > > > always > > > > ignored. This is a major issue, but I'm not sure if it is in the > > > index > > > > utils or in Lucene itself. > > > > > > > > > > > > > > > > Any thoughts? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Rick > > > > _______________________________________________ > > > > Neo4j mailing list > > > > [4][7][10]u...@lists.neo4j.org > > > > [5][8][11]https://lists.neo4j.org/mailman/listinfo/user > > > > > > > -- > > > Mattias Persson, [[6][9][12]matt...@neotechnology.com] > > > Hacker, Neo Technology > > > [7][10][13]www.neotechnology.com > > > _______________________________________________ > > > Neo4j mailing list > > > [8][11][14]u...@lists.neo4j.org > > > [9][12][15]https://lists.neo4j.org/mailman/listinfo/user > > > > > > References > > > > > > 1. [13][16]mailto:matt...@neotechnology.com > > > 2. [14][17]mailto:user@lists.neo4j.org > > > 3. [15][18]mailto:rick.bullo...@burningskysoftware.com > > > 4. [16][19]mailto:User@lists.neo4j.org > > > 5. [17][20]https://lists.neo4j.org/mailman/listinfo/user > > > 6. [18][21]mailto:matt...@neotechnology.com > > > 7. [19][22]http://www.neotechnology.com/ > > > 8. [20][23]mailto:User@lists.neo4j.org > > > 9. [21][24]https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > > > Neo4j mailing list > > > [22][25]u...@lists.neo4j.org > > > [23][26]https://lists.neo4j.org/mailman/listinfo/user > > > > > -- > > Mattias Persson, [[24][27]matt...@neotechnology.com] > > Hacker, Neo Technology > > [25][28]www.neotechnology.com > > _______________________________________________ > > Neo4j mailing list > > [26][29]u...@lists.neo4j.org > > [27][30]https://lists.neo4j.org/mailman/listinfo/user > > > > References > > > > 1. [31]mailto:matt...@neotechnology.com > > 2. [32]mailto:user@lists.neo4j.org > > 3. [33]mailto:rick.bullo...@burningskysoftware.com > > 4. [34]mailto:matt...@neotechnology.com > > 5. [35]mailto:user@lists.neo4j.org > > 6. [36]mailto:rick.bullo...@burningskysoftware.com > > 7. [37]mailto:User@lists.neo4j.org > > 8. [38]https://lists.neo4j.org/mailman/listinfo/user > > 9. [39]mailto:matt...@neotechnology.com > > 10. [40]http://www.neotechnology.com/ > > 11. [41]mailto:User@lists.neo4j.org > > 12. [42]https://lists.neo4j.org/mailman/listinfo/user > > 13. [43]mailto:matt...@neotechnology.com > > 14. [44]mailto:user@lists.neo4j.org > > 15. [45]mailto:rick.bullo...@burningskysoftware.com > > 16. [46]mailto:User@lists.neo4j.org > > 17. [47]https://lists.neo4j.org/mailman/listinfo/user > > 18. [48]mailto:matt...@neotechnology.com > > 19. [49]http://www.neotechnology.com/ > > 20. [50]mailto:User@lists.neo4j.org > > 21. [51]https://lists.neo4j.org/mailman/listinfo/user > > 22. [52]mailto:User@lists.neo4j.org > > 23. [53]https://lists.neo4j.org/mailman/listinfo/user > > 24. [54]mailto:matt...@neotechnology.com > > 25. [55]http://www.neotechnology.com/ > > 26. [56]mailto:User@lists.neo4j.org > > 27. [57]https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > > Neo4j mailing list > > [58]u...@lists.neo4j.org > > [59]https://lists.neo4j.org/mailman/listinfo/user > > > -- > Morten Barklund > _______________________________________________ > Neo4j mailing list > [60]u...@lists.neo4j.org > [61]https://lists.neo4j.org/mailman/listinfo/user > > References > > 1. mailto:mor...@barklund.dk > 2. mailto:user@lists.neo4j.org > 3. mailto:rick.bullo...@burningskysoftware.com > 4. mailto:matt...@neotechnology.com > 5. mailto:user@lists.neo4j.org > 6. mailto:rick.bullo...@burningskysoftware.com > 7. mailto:matt...@neotechnology.com > 8. mailto:user@lists.neo4j.org > 9. mailto:rick.bullo...@burningskysoftware.com > 10. mailto:User@lists.neo4j.org > 11. https://lists.neo4j.org/mailman/listinfo/user > 12. mailto:matt...@neotechnology.com > 13. http://www.neotechnology.com/ > 14. mailto:User@lists.neo4j.org > 15. https://lists.neo4j.org/mailman/listinfo/user > 16. mailto:matt...@neotechnology.com > 17. mailto:user@lists.neo4j.org > 18. mailto:rick.bullo...@burningskysoftware.com > 19. mailto:User@lists.neo4j.org > 20. https://lists.neo4j.org/mailman/listinfo/user > 21. mailto:matt...@neotechnology.com > 22. http://www.neotechnology.com/ > 23. mailto:User@lists.neo4j.org > 24. https://lists.neo4j.org/mailman/listinfo/user > 25. mailto:User@lists.neo4j.org > 26. https://lists.neo4j.org/mailman/listinfo/user > 27. mailto:matt...@neotechnology.com > 28. http://www.neotechnology.com/ > 29. mailto:User@lists.neo4j.org > 30. https://lists.neo4j.org/mailman/listinfo/user > 31. mailto:matt...@neotechnology.com > 32. mailto:user@lists.neo4j.org > 33. mailto:rick.bullo...@burningskysoftware.com > 34. mailto:matt...@neotechnology.com > 35. mailto:user@lists.neo4j.org > 36. mailto:rick.bullo...@burningskysoftware.com > 37. mailto:User@lists.neo4j.org > 38. https://lists.neo4j.org/mailman/listinfo/user > 39. mailto:matt...@neotechnology.com > 40. http://www.neotechnology.com/ > 41. mailto:User@lists.neo4j.org > 42. https://lists.neo4j.org/mailman/listinfo/user > 43. mailto:matt...@neotechnology.com > 44. mailto:user@lists.neo4j.org > 45. mailto:rick.bullo...@burningskysoftware.com > 46. mailto:User@lists.neo4j.org > 47. https://lists.neo4j.org/mailman/listinfo/user > 48. mailto:matt...@neotechnology.com > 49. http://www.neotechnology.com/ > 50. mailto:User@lists.neo4j.org > 51. https://lists.neo4j.org/mailman/listinfo/user > 52. mailto:User@lists.neo4j.org > 53. https://lists.neo4j.org/mailman/listinfo/user > 54. mailto:matt...@neotechnology.com > 55. http://www.neotechnology.com/ > 56. mailto:User@lists.neo4j.org > 57. https://lists.neo4j.org/mailman/listinfo/user > 58. mailto:User@lists.neo4j.org > 59. https://lists.neo4j.org/mailman/listinfo/user > 60. mailto:User@lists.neo4j.org > 61. https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user