This is probably what you just found, but for others:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

--
Toby Matejovsky


On Wed, Sep 15, 2010 at 12:49 PM, <rick.bullo...@burningskysoftware.com>wrote:

>   Removing HTML markup is not a trivial task, but luckily, the Apache
>   Solr team has already created additional analyzers for Lucene that do
>   what I need (the analysis package in solr has a lot of really good
>   stuff in it);
>
>
>
>   I will still need some help from the Neo team to understand how use a
>   specific analyzer instead of the default one...
>
>
>
>   Thanks,
>
>
>
>   Rick
>
>
>
>   -------- Original Message --------
>   Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   last word/term
>    From: Morten Barklund <[1]mor...@barklund.dk>
>   Date: Wed, September 15, 2010 12:29 pm
>   To: Neo4j user discussions <[2]u...@lists.neo4j.org>
>    Hi
>   I might be overly simplistic here, but why not lowercase the text,
>   remove
>   html markup, then remove all non-word-or-space-characters, store this
>   as the
>   stripped version of the text on the node (for de-indexing) and index
>   this?
>   /Barklund
>   On Wed, Sep 15, 2010 at 18:07,
>    <[3]rick.bullo...@burningskysoftware.com> wrote:
>   > Actually, it seems like a deeper bug/design flaw in Lucene's
>   > analyzer/tokenizer. The actual text is HTML text, with <p> and </p>
>   > wrappers. Lucene somewhat randomly seems to treat the last two words
>   > as a single token, and in other cases ignore it altogether. The dot
>   > character screws it up even more, because even if it tokenizes with
>   the
>   > dot character, you can't query with it (or at least nothing gets
>   > returned).
>   >
>   >
>   >
>   > Hmmm. I really don't want to have to write a tokenizer/analyzer if I
>   > can avoid it. Seems like a LOT of work.
>   >
>   >
>   >
>   > Do you have any example code of a custom tokenizer/analyzer we could
>   > start from?
>   >
>   >
>   >
>   > Thanks,
>   >
>   >
>   >
>   > Rick
>   >
>   > -------- Original Message --------
>   > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   > last word/term
>    > From: Mattias Persson <[1][4]matt...@neotechnology.com>
>    > Date: Wed, September 15, 2010 11:47 am
>    > To: Neo4j user discussions <[2][5]u...@lists.neo4j.org>
>    > Couldn't it be that sentences ends with a dot... so "Cheese is good."
>   > will
>   > index the words: ["Cheese", "is", "good."] ? Observe the last word
>   > isn't
>   > "good", it's "good." with a dot. I know that has messed up some
>   > searches for
>   > me at least. You could perhaps override the implementation and
>   > instantiate
>   > an Analyzer/Tokenizer which gets rid of such punctuation characters?
>    > 2010/9/15 <[3][6]rick.bullo...@burningskysoftware.com>
>    > > Using neo4j-index-1.1 and lucene-core-2.9.2, by the way.
>   > >
>   > >
>   > >
>   > >
>   > >
>   > > -------- Original Message --------
>   > > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   > > last word/term
>    > > From: Mattias Persson <[1][4][7]matt...@neotechnology.com>
>   > > Date: Wed, September 15, 2010 10:37 am
>    > > To: Neo4j user discussions <[2][5][8]u...@lists.neo4j.org>
>   > > That sounds weird. Look at
>   > > TestLuceneFulltextIndexService#testSimpleFulltext
>   > > method, it queries for the last word and it seems to work.
>   > > Could you provide more info on this?
>    > > 2010/9/15 <[3][6][9]rick.bullo...@burningskysoftware.com>
>    > > > I've noticed that when indexing full text, the last term/word is
>   > > always
>   > > > ignored. This is a major issue, but I'm not sure if it is in the
>   > > index
>   > > > utils or in Lucene itself.
>   > > >
>   > > >
>   > > >
>   > > > Any thoughts?
>   > > >
>   > > >
>   > > >
>   > > > Thanks,
>   > > >
>   > > >
>   > > >
>   > > > Rick
>   > > > _______________________________________________
>   > > > Neo4j mailing list
>    > > > [4][7][10]u...@lists.neo4j.org
>   > > > [5][8][11]https://lists.neo4j.org/mailman/listinfo/user
>   > > >
>   > > --
>   > > Mattias Persson, [[6][9][12]matt...@neotechnology.com]
>   > > Hacker, Neo Technology
>   > > [7][10][13]www.neotechnology.com
>   > > _______________________________________________
>   > > Neo4j mailing list
>   > > [8][11][14]u...@lists.neo4j.org
>   > > [9][12][15]https://lists.neo4j.org/mailman/listinfo/user
>   > >
>   > > References
>   > >
>   > > 1. [13][16]mailto:matt...@neotechnology.com
>   > > 2. [14][17]mailto:user@lists.neo4j.org
>   > > 3. [15][18]mailto:rick.bullo...@burningskysoftware.com
>   > > 4. [16][19]mailto:User@lists.neo4j.org
>   > > 5. [17][20]https://lists.neo4j.org/mailman/listinfo/user
>   > > 6. [18][21]mailto:matt...@neotechnology.com
>   > > 7. [19][22]http://www.neotechnology.com/
>   > > 8. [20][23]mailto:User@lists.neo4j.org
>   > > 9. [21][24]https://lists.neo4j.org/mailman/listinfo/user
>   > > _______________________________________________
>   > > Neo4j mailing list
>   > > [22][25]u...@lists.neo4j.org
>   > > [23][26]https://lists.neo4j.org/mailman/listinfo/user
>   > >
>   > --
>   > Mattias Persson, [[24][27]matt...@neotechnology.com]
>   > Hacker, Neo Technology
>   > [25][28]www.neotechnology.com
>   > _______________________________________________
>   > Neo4j mailing list
>   > [26][29]u...@lists.neo4j.org
>   > [27][30]https://lists.neo4j.org/mailman/listinfo/user
>   >
>   > References
>   >
>   > 1. [31]mailto:matt...@neotechnology.com
>   > 2. [32]mailto:user@lists.neo4j.org
>   > 3. [33]mailto:rick.bullo...@burningskysoftware.com
>   > 4. [34]mailto:matt...@neotechnology.com
>   > 5. [35]mailto:user@lists.neo4j.org
>   > 6. [36]mailto:rick.bullo...@burningskysoftware.com
>   > 7. [37]mailto:User@lists.neo4j.org
>   > 8. [38]https://lists.neo4j.org/mailman/listinfo/user
>   > 9. [39]mailto:matt...@neotechnology.com
>   > 10. [40]http://www.neotechnology.com/
>   > 11. [41]mailto:User@lists.neo4j.org
>   > 12. [42]https://lists.neo4j.org/mailman/listinfo/user
>   > 13. [43]mailto:matt...@neotechnology.com
>   > 14. [44]mailto:user@lists.neo4j.org
>   > 15. [45]mailto:rick.bullo...@burningskysoftware.com
>   > 16. [46]mailto:User@lists.neo4j.org
>   > 17. [47]https://lists.neo4j.org/mailman/listinfo/user
>   > 18. [48]mailto:matt...@neotechnology.com
>   > 19. [49]http://www.neotechnology.com/
>   > 20. [50]mailto:User@lists.neo4j.org
>   > 21. [51]https://lists.neo4j.org/mailman/listinfo/user
>   > 22. [52]mailto:User@lists.neo4j.org
>   > 23. [53]https://lists.neo4j.org/mailman/listinfo/user
>   > 24. [54]mailto:matt...@neotechnology.com
>   > 25. [55]http://www.neotechnology.com/
>   > 26. [56]mailto:User@lists.neo4j.org
>   > 27. [57]https://lists.neo4j.org/mailman/listinfo/user
>   > _______________________________________________
>   > Neo4j mailing list
>   > [58]u...@lists.neo4j.org
>   > [59]https://lists.neo4j.org/mailman/listinfo/user
>    >
>   --
>   Morten Barklund
>   _______________________________________________
>   Neo4j mailing list
>    [60]u...@lists.neo4j.org
>   [61]https://lists.neo4j.org/mailman/listinfo/user
>
> References
>
>   1. mailto:mor...@barklund.dk
>    2. mailto:user@lists.neo4j.org
>   3. mailto:rick.bullo...@burningskysoftware.com
>   4. mailto:matt...@neotechnology.com
>   5. mailto:user@lists.neo4j.org
>   6. mailto:rick.bullo...@burningskysoftware.com
>    7. mailto:matt...@neotechnology.com
>   8. mailto:user@lists.neo4j.org
>   9. mailto:rick.bullo...@burningskysoftware.com
>  10. mailto:User@lists.neo4j.org
>  11. https://lists.neo4j.org/mailman/listinfo/user
>  12. mailto:matt...@neotechnology.com
>  13. http://www.neotechnology.com/
>  14. mailto:User@lists.neo4j.org
>  15. https://lists.neo4j.org/mailman/listinfo/user
>  16. mailto:matt...@neotechnology.com
>  17. mailto:user@lists.neo4j.org
>  18. mailto:rick.bullo...@burningskysoftware.com
>  19. mailto:User@lists.neo4j.org
>  20. https://lists.neo4j.org/mailman/listinfo/user
>  21. mailto:matt...@neotechnology.com
>  22. http://www.neotechnology.com/
>  23. mailto:User@lists.neo4j.org
>  24. https://lists.neo4j.org/mailman/listinfo/user
>  25. mailto:User@lists.neo4j.org
>  26. https://lists.neo4j.org/mailman/listinfo/user
>  27. mailto:matt...@neotechnology.com
>  28. http://www.neotechnology.com/
>  29. mailto:User@lists.neo4j.org
>  30. https://lists.neo4j.org/mailman/listinfo/user
>  31. mailto:matt...@neotechnology.com
>  32. mailto:user@lists.neo4j.org
>  33. mailto:rick.bullo...@burningskysoftware.com
>  34. mailto:matt...@neotechnology.com
>  35. mailto:user@lists.neo4j.org
>  36. mailto:rick.bullo...@burningskysoftware.com
>  37. mailto:User@lists.neo4j.org
>  38. https://lists.neo4j.org/mailman/listinfo/user
>  39. mailto:matt...@neotechnology.com
>  40. http://www.neotechnology.com/
>  41. mailto:User@lists.neo4j.org
>  42. https://lists.neo4j.org/mailman/listinfo/user
>  43. mailto:matt...@neotechnology.com
>  44. mailto:user@lists.neo4j.org
>  45. mailto:rick.bullo...@burningskysoftware.com
>  46. mailto:User@lists.neo4j.org
>  47. https://lists.neo4j.org/mailman/listinfo/user
>  48. mailto:matt...@neotechnology.com
>  49. http://www.neotechnology.com/
>  50. mailto:User@lists.neo4j.org
>  51. https://lists.neo4j.org/mailman/listinfo/user
>  52. mailto:User@lists.neo4j.org
>  53. https://lists.neo4j.org/mailman/listinfo/user
>  54. mailto:matt...@neotechnology.com
>  55. http://www.neotechnology.com/
>  56. mailto:User@lists.neo4j.org
>  57. https://lists.neo4j.org/mailman/listinfo/user
>  58. mailto:User@lists.neo4j.org
>  59. https://lists.neo4j.org/mailman/listinfo/user
>  60. mailto:User@lists.neo4j.org
>  61. https://lists.neo4j.org/mailman/listinfo/user
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to