SV: Indexing HTML

2002-12-04 Thread Ronnie Kolehmainen
Dear Leo,

I'm not sure this is a solution to your problem. However, it seems that the
HTMLParser used by the IndexHTML class has problems parsing the document
(there is a test class included in the jar):


>java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered "\'" at line 106, column 27.
Was expecting one of:
 ...
 ...


If you look at the source of that document you can see there is a Javascript
with this problematic line:


document.write('http://ad.webcz.cz/adwebcz/adscript.asp?a=10&t=0&b=0&x=468&y=60&nocache
=' + nIndex + '">');
^


Looks to me the HTMLParser does _not_ treat/handle the  tags
correct, i e ignore everything until . If you check stdout there
should be error messages from the ParserThread class like the one above.

I tried parsing the same document with another html parser class without any
problems. Maybe try replacing the HTMLParser class used by HTMLDocument with
your own? Or edit the HTMLParser.jj file if you have javacc knowledge.


/Ronnie



> -Ursprungligt meddelande-
> Fran: Leo Galambos [mailto:[EMAIL PROTECTED]]
> Skickat: den 3 december 2002 20:32
> Till: [EMAIL PROTECTED]
> Amne: Indexing HTML
>
>
> I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but
> Lucene often falls to never-ending loop. I've analyzed my data, so I know
> what file(s) sent Lucene down. I don't see anything special in the
> file(s), so I think, that it can go throught parser to main Lucene
> routines (and then the problem could be in Merger).
>
> Could you help me, please?
>
> One of the problematic files:
> http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt
> My program (based on Lucene demo):
> http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java
>
> Thank you very much.
>
> -g-
>
>
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Incremental indexing

2002-12-04 Thread Eric Jain
Currently, I use the following procedure to update an index incrementally:

1. Build document
2. Open index reader
3. Delete any previous version of the document using a key field
4. Close index reader
5. Open index writer
6. Add document to index
7. Close index writer

Repeat


Any ideas how this could be accomplished more efficiently AND easier?


--
Eric Jain


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




AW: PDFBox 0.5.6

2002-12-04 Thread Borkenhagen, Michael (ofd-ko zdfin)
Thank You very very much !
This version is really great - it fixes most of the Problems I had with
earlier versions!

-Ursprüngliche Nachricht-
Von: Ben Litchfield [mailto:[EMAIL PROTECTED]]
Gesendet: Freitag, 29. November 2002 04:42
An: [EMAIL PROTECTED]
Betreff: PDFBox 0.5.6



PDFBox version 0.5.6 is now available at http://www.pdfbox.org

PDFBox makes it easy to add PDF Documents to a lucene index.

Fixes over the last version

-Fixed bug in LucenePDFDocument where stream was not being closed and
small documents were not being indexed.
-Fixed a spacing issue for some PDF documents.
-Fixed error while parsing the version number
-Fixed NullPointer in persistence example.
-Create example lucene IndexFiles class which models the demo from lucene.
-Fixed bug where garbage at the end of file caused an infinite loop
-Fixed bug in parsing boolean values with stuff at the end like "true>>"


Ben Litchfield



--
To unsubscribe, e-mail:

For additional commands, e-mail:



--
To unsubscribe, e-mail:   
For additional commands, e-mail: