Removing HTML markup is not a trivial task, but luckily, the Apache Solr team has already created additional analyzers for Lucene that do what I need (the analysis package in solr has a lot of really good stuff in it);
I will still need some help from the Neo team to understand how use a specific analyzer instead of the default one... Thanks, Rick -------- Original Message -------- Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring last word/term From: Morten Barklund <[1]mor...@barklund.dk> Date: Wed, September 15, 2010 12:29 pm To: Neo4j user discussions <[2]u...@lists.neo4j.org> Hi I might be overly simplistic here, but why not lowercase the text, remove html markup, then remove all non-word-or-space-characters, store this as the stripped version of the text on the node (for de-indexing) and index this? /Barklund On Wed, Sep 15, 2010 at 18:07, <[3]rick.bullo...@burningskysoftware.com> wrote: > Actually, it seems like a deeper bug/design flaw in Lucene's > analyzer/tokenizer. The actual text is HTML text, with <p> and </p> > wrappers. Lucene somewhat randomly seems to treat the last two words > as a single token, and in other cases ignore it altogether. The dot > character screws it up even more, because even if it tokenizes with the > dot character, you can't query with it (or at least nothing gets > returned). > > > > Hmmm. I really don't want to have to write a tokenizer/analyzer if I > can avoid it. Seems like a LOT of work. > > > > Do you have any example code of a custom tokenizer/analyzer we could > start from? > > > > Thanks, > > > > Rick > > -------- Original Message -------- > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring > last word/term > From: Mattias Persson <[1][4]matt...@neotechnology.com> > Date: Wed, September 15, 2010 11:47 am > To: Neo4j user discussions <[2][5]u...@lists.neo4j.org> > Couldn't it be that sentences ends with a dot... so "Cheese is good." > will > index the words: ["Cheese", "is", "good."] ? Observe the last word > isn't > "good", it's "good." with a dot. I know that has messed up some > searches for > me at least. You could perhaps override the implementation and > instantiate > an Analyzer/Tokenizer which gets rid of such punctuation characters? > 2010/9/15 <[3][6]rick.bullo...@burningskysoftware.com> > > Using neo4j-index-1.1 and lucene-core-2.9.2, by the way. > > > > > > > > > > > > -------- Original Message -------- > > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring > > last word/term > > From: Mattias Persson <[1][4][7]matt...@neotechnology.com> > > Date: Wed, September 15, 2010 10:37 am > > To: Neo4j user discussions <[2][5][8]u...@lists.neo4j.org> > > That sounds weird. Look at > > TestLuceneFulltextIndexService#testSimpleFulltext > > method, it queries for the last word and it seems to work. > > Could you provide more info on this? > > 2010/9/15 <[3][6][9]rick.bullo...@burningskysoftware.com> > > > I've noticed that when indexing full text, the last term/word is > > always > > > ignored. This is a major issue, but I'm not sure if it is in the > > index > > > utils or in Lucene itself. > > > > > > > > > > > > Any thoughts? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Rick > > > _______________________________________________ > > > Neo4j mailing list > > > [4][7][10]u...@lists.neo4j.org > > > [5][8][11]https://lists.neo4j.org/mailman/listinfo/user > > > > > -- > > Mattias Persson, [[6][9][12]matt...@neotechnology.com] > > Hacker, Neo Technology > > [7][10][13]www.neotechnology.com > > _______________________________________________ > > Neo4j mailing list > > [8][11][14]u...@lists.neo4j.org > > [9][12][15]https://lists.neo4j.org/mailman/listinfo/user > > > > References > > > > 1. [13][16]mailto:matt...@neotechnology.com > > 2. [14][17]mailto:user@lists.neo4j.org > > 3. [15][18]mailto:rick.bullo...@burningskysoftware.com > > 4. [16][19]mailto:User@lists.neo4j.org > > 5. [17][20]https://lists.neo4j.org/mailman/listinfo/user > > 6. [18][21]mailto:matt...@neotechnology.com > > 7. [19][22]http://www.neotechnology.com/ > > 8. [20][23]mailto:User@lists.neo4j.org > > 9. [21][24]https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > > Neo4j mailing list > > [22][25]u...@lists.neo4j.org > > [23][26]https://lists.neo4j.org/mailman/listinfo/user > > > -- > Mattias Persson, [[24][27]matt...@neotechnology.com] > Hacker, Neo Technology > [25][28]www.neotechnology.com > _______________________________________________ > Neo4j mailing list > [26][29]u...@lists.neo4j.org > [27][30]https://lists.neo4j.org/mailman/listinfo/user > > References > > 1. [31]mailto:matt...@neotechnology.com > 2. [32]mailto:user@lists.neo4j.org > 3. [33]mailto:rick.bullo...@burningskysoftware.com > 4. [34]mailto:matt...@neotechnology.com > 5. [35]mailto:user@lists.neo4j.org > 6. [36]mailto:rick.bullo...@burningskysoftware.com > 7. [37]mailto:User@lists.neo4j.org > 8. [38]https://lists.neo4j.org/mailman/listinfo/user > 9. [39]mailto:matt...@neotechnology.com > 10. [40]http://www.neotechnology.com/ > 11. [41]mailto:User@lists.neo4j.org > 12. [42]https://lists.neo4j.org/mailman/listinfo/user > 13. [43]mailto:matt...@neotechnology.com > 14. [44]mailto:user@lists.neo4j.org > 15. [45]mailto:rick.bullo...@burningskysoftware.com > 16. [46]mailto:User@lists.neo4j.org > 17. [47]https://lists.neo4j.org/mailman/listinfo/user > 18. [48]mailto:matt...@neotechnology.com > 19. [49]http://www.neotechnology.com/ > 20. [50]mailto:User@lists.neo4j.org > 21. [51]https://lists.neo4j.org/mailman/listinfo/user > 22. [52]mailto:User@lists.neo4j.org > 23. [53]https://lists.neo4j.org/mailman/listinfo/user > 24. [54]mailto:matt...@neotechnology.com > 25. [55]http://www.neotechnology.com/ > 26. [56]mailto:User@lists.neo4j.org > 27. [57]https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ > Neo4j mailing list > [58]u...@lists.neo4j.org > [59]https://lists.neo4j.org/mailman/listinfo/user > -- Morten Barklund _______________________________________________ Neo4j mailing list [60]u...@lists.neo4j.org [61]https://lists.neo4j.org/mailman/listinfo/user References 1. mailto:mor...@barklund.dk 2. mailto:user@lists.neo4j.org 3. mailto:rick.bullo...@burningskysoftware.com 4. mailto:matt...@neotechnology.com 5. mailto:user@lists.neo4j.org 6. mailto:rick.bullo...@burningskysoftware.com 7. mailto:matt...@neotechnology.com 8. mailto:user@lists.neo4j.org 9. mailto:rick.bullo...@burningskysoftware.com 10. mailto:User@lists.neo4j.org 11. https://lists.neo4j.org/mailman/listinfo/user 12. mailto:matt...@neotechnology.com 13. http://www.neotechnology.com/ 14. mailto:User@lists.neo4j.org 15. https://lists.neo4j.org/mailman/listinfo/user 16. mailto:matt...@neotechnology.com 17. mailto:user@lists.neo4j.org 18. mailto:rick.bullo...@burningskysoftware.com 19. mailto:User@lists.neo4j.org 20. https://lists.neo4j.org/mailman/listinfo/user 21. mailto:matt...@neotechnology.com 22. http://www.neotechnology.com/ 23. mailto:User@lists.neo4j.org 24. https://lists.neo4j.org/mailman/listinfo/user 25. mailto:User@lists.neo4j.org 26. https://lists.neo4j.org/mailman/listinfo/user 27. mailto:matt...@neotechnology.com 28. http://www.neotechnology.com/ 29. mailto:User@lists.neo4j.org 30. https://lists.neo4j.org/mailman/listinfo/user 31. mailto:matt...@neotechnology.com 32. mailto:user@lists.neo4j.org 33. mailto:rick.bullo...@burningskysoftware.com 34. mailto:matt...@neotechnology.com 35. mailto:user@lists.neo4j.org 36. mailto:rick.bullo...@burningskysoftware.com 37. mailto:User@lists.neo4j.org 38. https://lists.neo4j.org/mailman/listinfo/user 39. mailto:matt...@neotechnology.com 40. http://www.neotechnology.com/ 41. mailto:User@lists.neo4j.org 42. https://lists.neo4j.org/mailman/listinfo/user 43. mailto:matt...@neotechnology.com 44. mailto:user@lists.neo4j.org 45. mailto:rick.bullo...@burningskysoftware.com 46. mailto:User@lists.neo4j.org 47. https://lists.neo4j.org/mailman/listinfo/user 48. mailto:matt...@neotechnology.com 49. http://www.neotechnology.com/ 50. mailto:User@lists.neo4j.org 51. https://lists.neo4j.org/mailman/listinfo/user 52. mailto:User@lists.neo4j.org 53. https://lists.neo4j.org/mailman/listinfo/user 54. mailto:matt...@neotechnology.com 55. http://www.neotechnology.com/ 56. mailto:User@lists.neo4j.org 57. https://lists.neo4j.org/mailman/listinfo/user 58. mailto:User@lists.neo4j.org 59. https://lists.neo4j.org/mailman/listinfo/user 60. mailto:User@lists.neo4j.org 61. https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user