SV: SV: Indexing HTML
HI, these are the classes i use. I only use them to extract the text stuff, so they don't have methods for getting document title and such. However text extraction has worked fine for me. The HtmlParser main method takes a file path as argument and outputs the contents to a file named html.txt - useful when testing. /Ronnie -Ursprungligt meddelande- Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Skickat: den 7 december 2002 17:12 Till: Lucene Users List Amne: Re: SV: Indexing HTML I have had good experiences with nekoHTML parser. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... /Ronnie Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: \u0178 (376), after : : adding save/d00320/f01527.html Parse Aborted: Encountered = at line 83, column 48. Was expecting one of: ArgName ... TagEnd ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] HtmlDocument.java Description: Binary data HtmlParser.java Description: Binary data -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Incremental indexing
1. Open reader; 2. Delete all old documents; 3. Close reader; 4. Open writer; 5. Add all new documents; 6. Close writer. If, before step one, you open another IndexReader, then you can continue to use it for searches while the update is in progress. If you then, after step six, open a new IndexReader to use for searches, then no searches will ever see the intermediate state when documents have been deleted but not yet re-added. Thanks! Now all that's missing is rollback :-) -- Eric Jain -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Keyword fields which don't contribute to a document's score?
Thanks. I'll take a look. From: Doug Cutting [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: Re: Keyword fields which don't contribute to a document's score? Date: Fri, 06 Dec 2002 15:27:42 -0800 In the pre-release version available in the nightly builds you can boost document fields at index time. Check out the CHANGES.txt file for details. Doug Ashley Collins wrote: Is it possible to stop keyword fields contributing to a document's score? Leaving only text fields? Is the best way to boost the terms I know are keyword fields by small numbers? e.g. sender:[EMAIL PROTECTED]^0.001 Thanks. Ashley _ MSN 8 with e-mail virus protection service: 2 months FREE* http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] _ Add photos to your messages with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
larm and lucene?
Has anyone out there sucessfully implemented the larm with lucene? I have been pouring over the larm source (since there's no external documentation) with little success getting it to behave properly (controlling it's spidering behavior/paths transversed), much less luck in determining where I should throw my lucene hooks into the larm source. Any suggestions or pointers appreciated. Dominic madison.com _ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: larm and lucene?
I believe the place to hook Lucene into LARM is in FetcherMain, where LuceneStorage should be created. I have used it and it created the index successfully. I never wrote any code to search that index. Otis --- host unknown [EMAIL PROTECTED] wrote: Has anyone out there sucessfully implemented the larm with lucene? I have been pouring over the larm source (since there's no external documentation) with little success getting it to behave properly (controlling it's spidering behavior/paths transversed), much less luck in determining where I should throw my lucene hooks into the larm source. Any suggestions or pointers appreciated. Dominic madison.com _ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: larm and lucene?
I've had a some problems getting the webcrawler working.I think Clemens was experimenting recently with some parts of the code for LARM the Next Generation, but I believe he'll stabilise the code in CVS. from Clemens I will have a look at the code during the next days. I must admit I made some changes (esp. the hostResolver) that I did not test thoroughly inside LARM (I changed the HostManager behavior outside LARM). I will try to fix this. /from Clemens Cheers, Stephane Otis Gospodnetic wrote: I believe the place to hook Lucene into LARM is in FetcherMain, where LuceneStorage should be created. I have used it and it created the index successfully. I never wrote any code to search that index. Otis --- host unknown [EMAIL PROTECTED] wrote: Has anyone out there sucessfully implemented the larm with lucene? I have been pouring over the larm source (since there's no external documentation) with little success getting it to behave properly (controlling it's spidering behavior/paths transversed), much less luck in determining where I should throw my lucene hooks into the larm source. Any suggestions or pointers appreciated. Dominic madison.com _ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
prevent re-indexing
Hi all, I have a rather large file system that I'm indexing (php/html files actually). I'm reindexing on a daily basis, however I don't want/need to reindex 95+% of my files since they're not going to change. Is there currently the capiblilty to look at the last modified date and check it against the file that has already been indexed before re-indexing the file? Or is this something that needs to be implemented? Thanks again, Dominic madison.com PS. Thanks for the quick responses last time...the spider is starting to behave :-) _ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: prevent re-indexing
That's an application specific behaviour that you need to add to your indexing app. Otis --- host unknown [EMAIL PROTECTED] wrote: Hi all, I have a rather large file system that I'm indexing (php/html files actually). I'm reindexing on a daily basis, however I don't want/need to reindex 95+% of my files since they're not going to change. Is there currently the capiblilty to look at the last modified date and check it against the file that has already been indexed before re-indexing the file? Or is this something that needs to be implemented? Thanks again, Dominic madison.com PS. Thanks for the quick responses last time...the spider is starting to behave :-) _ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
A newbie Question
hi all I was running the demo java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src and it says it will produce a subdirctory called index: but i can't find it . Do any one know where it is kept ? thanks alan -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: prevent re-indexing
I agree with Otis on this. In your application that is indexing, save the last time you started indexing. Then next time you index, read the previous time in and just index file modified since this date. This doesn't deal with deletes, but that would require a bit more work Jonathan -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Monday, December 09, 2002 1:20 PM To: Lucene Users List Subject: Re: prevent re-indexing That's an application specific behaviour that you need to add to your indexing app. Otis --- host unknown [EMAIL PROTECTED] wrote: Hi all, I have a rather large file system that I'm indexing (php/html files actually). I'm reindexing on a daily basis, however I don't want/need to reindex 95+% of my files since they're not going to change. Is there currently the capiblilty to look at the last modified date and check it against the file that has already been indexed before re-indexing the file? Or is this something that needs to be implemented? Thanks again, Dominic madison.com PS. Thanks for the quick responses last time...the spider is starting to behave :-) _ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: A newbie Question
if u c the api given by lucene, there u will get details of how to run the programs(samples). bfore running the search program u have to index the files u need to search, for this first u need to run the indexing program, that creates a folder in the current directoy structure with the name 'index'. rgds srinivas --- alex [EMAIL PROTECTED] wrote: hi all I was running the demo java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src and it says it will produce a subdirctory called index: but i can't find it . Do any one know where it is kept ? thanks alan -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]