Thanks! Yeah, that's exactly what I need to do. I can really just leave Lucene out of the picture, and just feed it the "content" text that I have parsed out of the html "document". Thanks again for your help!
-----Original Message----- From: Craig Walls [mailto:wallsc@;michaels.com] Sent: Thursday, November 14, 2002 3:18 PM To: Lucene Users List Subject: Re: HTML Analyzer? Oh wait...I just re-read your original post and apparently I misunderstood (I had just found a hammer and at a glance your problem looked like a nail). Sorry about that. But it's not entirely unrelated...You *could* write your own Analyzer that uses the HTMLEditorKit stuff to only index the text of your HTML documents and not the tags. I'm not sure if that's exactly what you want, but it's an option. Regarding the HTMLEditorKit and related stuff, it's not that hard, actually. It's not entirely unlike using a SAX parser to parse XML documents. The gist of it is as follows: First you create a class that extends HTMLEditorKit.ParserCallback. By default all of the "handleXXX()" methods are empty implementations, so override the ones you care about. In my case, I only needed to read comments and text, so I overrode the handleComment() and handleText() methods. Now, in the code that will use all of this, I started by opening a java.io.FileReader on my HTML document. Next, I created an instance of ParserDelegator (I've not delved into the API code yet to know why I had to do this). Then I called the parse() method of my ParserDelegator, passing it the reader, an instnace of my ParserCallback class, and a boolean true. At that time, the parser took over and started making calls back to the handleXXX() methods that I wrote. "Lichty, Kent" wrote: Well, let me know if you figure it out and I will do the same. I don't quite understand how those classes would help out. Would you somehow use them to create the Reader object that is passed to create the TokenStream object? "Lichty, Kent" wrote: > Well, let me know if you figure it out and I will do the same. I don't > quite understand how those classes would help out. Would you somehow use > them to create the Reader object that is passed to create the TokenStream > object? > > -----Original Message----- > From: Craig Walls [mailto:wallsc@;michaels.com] > Sent: Thursday, November 14, 2002 2:39 PM > To: Lucene Users List > Subject: Re: HTML Analyzer? > > Ironically, I just had to solve this exact problem just 10 minutes ago... > > Check into javax.swing.text.html.HTMLEditorKit and > javax.swing.text.html.HTMLDocument. Here's a URL that I found helpful (the > site > is Japanese, but the source code is still Java): > > http://java-house.jp/ml/archive/j-h-b/037727.html?#_body > > "Lichty, Kent" wrote: > > > We have a web application that builds pages "on the fly" by reading > directly > > from a database. The database contains both normal content and HTML. We > use > > Lucene as our search engine, but I need to figure out how to cause it to > NOT > > include content that is within HTML tags. I assume that this entails the > > creation of a custom Analyzer. Are there any existing Analyzers already > out > > there that work like this? Thanks! > > > > ---------- Internet E-mail Confidentiality Disclaimer ---------- > > > > PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message. > If > > you are not the addressee indicated in this message or the employee or > agent > > responsible for delivering it to the addressee, you are hereby on notice > > that you are in possession of confidential and privileged information. > Any > > dissemination, distribution, or copying of this e-mail is strictly > > prohibited. In such case, you should destroy this message and kindly > notify > > the sender by reply e-mail. Please advise immediately if you or your > > employer do not consent to Internet email for messages of this kind. > > > > Opinions, conclusions, and other information in this message that do not > > relate to the official business of my firm shall be understood as neither > > given nor endorsed by it. > > > > -- > > To unsubscribe, e-mail: > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > For additional commands, e-mail: > <mailto:lucene-user-help@;jakarta.apache.org> > > -- > To unsubscribe, e-mail: > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: > <mailto:lucene-user-help@;jakarta.apache.org> > > ---------- Internet E-mail Confidentiality Disclaimer ---------- > > PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message. If > you are not the addressee indicated in this message or the employee or agent > responsible for delivering it to the addressee, you are hereby on notice > that you are in possession of confidential and privileged information. Any > dissemination, distribution, or copying of this e-mail is strictly > prohibited. In such case, you should destroy this message and kindly notify > the sender by reply e-mail. Please advise immediately if you or your > employer do not consent to Internet email for messages of this kind. > > Opinions, conclusions, and other information in this message that do not > relate to the official business of my firm shall be understood as neither > given nor endorsed by it. > > -- > To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> ---------- Internet E-mail Confidentiality Disclaimer ---------- PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message. If you are not the addressee indicated in this message or the employee or agent responsible for delivering it to the addressee, you are hereby on notice that you are in possession of confidential and privileged information. Any dissemination, distribution, or copying of this e-mail is strictly prohibited. In such case, you should destroy this message and kindly notify the sender by reply e-mail. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. Opinions, conclusions, and other information in this message that do not relate to the official business of my firm shall be understood as neither given nor endorsed by it. -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>