Dear Leo,

I'm not sure this is a solution to your problem. However, it seems that the
HTMLParser used by the IndexHTML class has problems parsing the document
(there is a test class included in the jar):


>java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered "\'" at line 106, column 27.
Was expecting one of:
    <ArgName> ...
    <TagEnd> ...


If you look at the source of that document you can see there is a Javascript
with this problematic line:


        document.write('<s' + 'cript
src="http://ad.webcz.cz/adwebcz/adscript.asp?a=10&t=0&b=0&x=468&y=60&nocache
=' + nIndex + '">');
                        ^


Looks to me the HTMLParser does _not_ treat/handle the <script> tags
correct, i e ignore everything until </script>. If you check stdout there
should be error messages from the ParserThread class like the one above.

I tried parsing the same document with another html parser class without any
problems. Maybe try replacing the HTMLParser class used by HTMLDocument with
your own? Or edit the HTMLParser.jj file if you have javacc knowledge.


/Ronnie



> -----Ursprungligt meddelande-----
> Fran: Leo Galambos [mailto:[EMAIL PROTECTED]]
> Skickat: den 3 december 2002 20:32
> Till: [EMAIL PROTECTED]
> Amne: Indexing HTML
>
>
> I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but
> Lucene often falls to never-ending loop. I've analyzed my data, so I know
> what file(s) sent Lucene down. I don't see anything special in the
> file(s), so I think, that it can go throught parser to main Lucene
> routines (and then the problem could be in Merger).
>
> Could you help me, please?
>
> One of the problematic files:
> http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt
> My program (based on Lucene demo):
> http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java
>
> Thank you very much.
>
> -g-
>
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to