Re: SV: Indexing HTML

2002-12-07 Thread Leo Galambos
> I'm not sure this is a solution to your problem. However, it seems that the
> HTMLParser used by the IndexHTML class has problems parsing the document
> (there is a test class included in the jar):
> 
> 
> >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> org.apache.lucene.demo.html.Test f01529.txt
> Title: Webcz.cz - Power of search
> Parse Aborted: Encountered "\'" at line 106, column 27.
> Was expecting one of:
>  ...
>  ...
> /Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug. If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered: "\u0178" 
(376), after : ""
:
adding save/d00320/f01527.html
Parse Aborted: Encountered "=" at line 83, column 48.
Was expecting one of:
 ...
 ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: SV: Indexing HTML

2002-12-07 Thread Otis Gospodnetic
I have had good experiences with nekoHTML parser.

Otis

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > I'm not sure this is a solution to your problem. However, it seems
> that the
> > HTMLParser used by the IndexHTML class has problems parsing the
> document
> > (there is a test class included in the jar):
> > 
> > 
> > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> > org.apache.lucene.demo.html.Test f01529.txt
> > Title: Webcz.cz - Power of search
> > Parse Aborted: Encountered "\'" at line 106, column 27.
> > Was expecting one of:
> >  ...
> >  ...
> > /Ronnie
> 
> Hi Ronnie!
> 
> I know about it and the exception is handled well (see log file
> below). I
> have found a better example than 1529, try this:
> http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
> throught
> Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
> is
> specific, i.e. it has two titles, two base tags etc.
> 
> I have not debugger here, so I cannot find the line where is the bug.
> If
> you try your magic, please, let me know about the patch. :) THX
> 
> -g-
> 
> 
> 
> adding save/d00320/f01516.html
> Parse Aborted: Lexical error at line 68, column 11.  Encountered:
> "\u0178" 
> (376), after : ""
> :
> adding save/d00320/f01527.html
> Parse Aborted: Encountered "=" at line 83, column 48.
> Was expecting one of:
>  ...
>  ...
> 
> adding save/d00320/f01528.html
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> 
> For additional commands, e-mail:
> 
> 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail: