Hi Otis,

The idea behind stripHTML is pretty simple.  It's just a hand-built 
lexer that looks like this:

while more char
        if comment start, scarf til end comment
        if char is < then
                if SCRIPT tag scarf til end SCRIPT;
                        [same with A, STYLE, HEAD, FORM etc...]
        endif
        if &blort; and not stuff lile LT, GT, AMP, QUOT, NBSP, scarf
        if whitespace scarf leaving just one bit of whitespace
        otherwise just add char to stripped text.
endwhile

Nothing tricky in the code, but I should have used ANTLR itself to build 
this lexer.  it got to be a big method with lots of book keeping code as 
in all lexers.

Ter

On Thursday, April 18, 2002, at 11:05  PM, Otis Gospodnetic wrote:

> Hello Terrence,
>
> Ah, you got me.
> I guess I need a bit of both.
> I need to just strip HTML and get raw body text so that I can stick it
> in Lucene's index.
> I would also like something that can extract at least the
> <title>...</title> stuff, so that I can stick that in a separate field
> in Lucene index.
> While doing that I, like you, need to be able to handle poorly
> formatted web pages.
>
> In a future I may need something that has the ability to extract HREFs,
> but I'll stick to one of the XP principles and just look for something
> that meets current needs :)
>
> I looked for ANTLR-based HTML parser a few days ago, but must have
> missed the one you pointed out.  I'll take a look at it now.
> Can you share or describe your stripHTML method?  Simple java that
> looks for <s and >s or something smarter?
>
> Thanks,
> Otis
> P.S.
> This type of thing makes me wish I can use Perl or Python :)
>
>
> --- Terence Parr <[EMAIL PROTECTED]> wrote:
>>
>> On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
>>
>>> Hello,
>>>
>>> I need to select an HTML parser for the application that I'm
>> writing
>>> and I'm not sure what to choose.
>>> The HTML parser included with Lucene looks flimsy, JTidy looks like
>> a
>>> hack and an overkill, using classes written for Swing
>>> (javax.swing.text.html.parser) seems wrong, and I haven't tried
>> David
>>> McNicol's parser (included with Spindle).
>>>
>>> Somebody on this list must have done some research on this subject.
>>> Can anyone share some experiences?
>>> Have you found a better HTML parser than any of those I listed
>> above?
>>> If your application deals with HTML, what do you use for parsing
>> it?
>>
>> Hi Otis,
>>
>> I have an HTML parser built for ANTLR, but it's pretty strict in what
>> it
>> accepts.  Not sure how useful it will be for you, but here it is:
>>
>> http://www.antlr.org/grammars/HTML
>>
>> I am not sure what your goal is, but I personally have to scarf all
>> sorts of HTML from various websites to such them into the jGuru
>> search
>> engine.  I use a simple stripHTML() method I wrote to handle it.
>> Works
>> great.  Kills everything but the text.  is that the kind of thing you
>>
>> are looking for or do you really want to parse not filter?
>>
>> Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>> For additional commands, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Tax Center - online filing with TurboTax
> http://taxes.yahoo.com/
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-
> [EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:lucene-user-
> [EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to