SV: SV: Indexing HTML

2002-12-09 Thread Ronnie Kolehmainen
HI,

these are the classes i use. I only use them to extract the text stuff, so
they don't have methods for getting document title and such. However text
extraction has worked fine for me.

The HtmlParser main method takes a file path as argument and outputs the
contents to a file named html.txt - useful when testing.

/Ronnie


 -Ursprungligt meddelande-
 Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
 Skickat: den 7 december 2002 17:12
 Till: Lucene Users List
 Amne: Re: SV: Indexing HTML


 I have had good experiences with nekoHTML parser.

 Otis

 --- Leo Galambos [EMAIL PROTECTED] wrote:
   I'm not sure this is a solution to your problem. However, it seems
  that the
   HTMLParser used by the IndexHTML class has problems parsing the
  document
   (there is a test class included in the jar):
  
  
   java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
   org.apache.lucene.demo.html.Test f01529.txt
   Title: Webcz.cz - Power of search
   Parse Aborted: Encountered \' at line 106, column 27.
   Was expecting one of:
   ArgName ...
   TagEnd ...
   /Ronnie
 
  Hi Ronnie!
 
  I know about it and the exception is handled well (see log file
  below). I
  have found a better example than 1529, try this:
  http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
  throught
  Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
  is
  specific, i.e. it has two titles, two base tags etc.
 
  I have not debugger here, so I cannot find the line where is the bug.
  If
  you try your magic, please, let me know about the patch. :) THX
 
  -g-
 
 
 
  adding save/d00320/f01516.html
  Parse Aborted: Lexical error at line 68, column 11.  Encountered:
  \u0178
  (376), after : 
  :
  adding save/d00320/f01527.html
  Parse Aborted: Encountered = at line 83, column 48.
  Was expecting one of:
  ArgName ...
  TagEnd ...
 
  adding save/d00320/f01528.html
 
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 


 __
 Do you Yahoo!?
 Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
 http://mailplus.yahoo.com

 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]





HtmlDocument.java
Description: Binary data


HtmlParser.java
Description: Binary data
--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Incremental indexing

2002-12-09 Thread Eric Jain
1. Open reader;
2. Delete all old documents;
3. Close reader;
4. Open writer;
5. Add all new documents;
6. Close writer.
 
 If, before step one, you open another IndexReader, then you can continue 
 to use it for searches while the update is in progress.  If you then, 
 after step six, open a new IndexReader to use for searches, then no 
 searches will ever see the intermediate state when documents have been 
 deleted but not yet re-added.

Thanks! Now all that's missing is rollback :-)


--
Eric Jain


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Keyword fields which don't contribute to a document's score?

2002-12-09 Thread Ashley Collins


Thanks. I'll take a look.


From: Doug Cutting [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: Re: Keyword fields which don't contribute to a document's score?
Date: Fri, 06 Dec 2002 15:27:42 -0800

In the pre-release version available in the nightly builds you can boost 
document fields at index time.  Check out the CHANGES.txt file for details.

Doug

Ashley Collins wrote:

Is it possible to stop keyword fields contributing to a document's score? 
Leaving only text fields?

Is the best way to boost the terms I know are keyword fields by small 
numbers?

e.g. sender:[EMAIL PROTECTED]^0.001

Thanks.
Ashley




_
MSN 8 with e-mail virus protection service: 2 months FREE* 
http://join.msn.com/?page=features/virus


--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]


_
Add photos to your messages with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



larm and lucene?

2002-12-09 Thread host unknown
Has anyone out there sucessfully implemented the larm with lucene?

I have been pouring over the larm source (since there's no external 
documentation) with little success getting it to behave properly 
(controlling it's spidering behavior/paths transversed), much less luck in 
determining where I should throw my lucene hooks into the larm source.

Any suggestions or pointers appreciated.
Dominic
madison.com

_
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. 
http://join.msn.com/?page=features/virus


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: larm and lucene?

2002-12-09 Thread Otis Gospodnetic
I believe the place to hook Lucene into LARM is in FetcherMain, where
LuceneStorage should be created.  I have used it and it created the
index successfully.  I never wrote any code to search that index.

Otis


--- host unknown [EMAIL PROTECTED] wrote:
 Has anyone out there sucessfully implemented the larm with lucene?
 
 I have been pouring over the larm source (since there's no external 
 documentation) with little success getting it to behave properly 
 (controlling it's spidering behavior/paths transversed), much less
 luck in 
 determining where I should throw my lucene hooks into the larm
 source.
 
 Any suggestions or pointers appreciated.
 Dominic
 madison.com
 
 _
 MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. 
 http://join.msn.com/?page=features/virus
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: larm and lucene?

2002-12-09 Thread stephane vaucher
I've had a some problems getting the webcrawler working.I think Clemens was experimenting recently with some parts of the code for LARM the Next Generation, but I believe he'll stabilise the code in CVS.

from Clemens
I will have a look at the code during the next days. I must admit I made some changes (esp. the hostResolver) that I did
not test thoroughly inside LARM (I changed the HostManager behavior outside LARM). I will try to fix this.
/from Clemens

Cheers,
Stephane


Otis Gospodnetic wrote:


I believe the place to hook Lucene into LARM is in FetcherMain, where
LuceneStorage should be created.  I have used it and it created the
index successfully.  I never wrote any code to search that index.

Otis


--- host unknown [EMAIL PROTECTED] wrote:


Has anyone out there sucessfully implemented the larm with lucene?

I have been pouring over the larm source (since there's no external 
documentation) with little success getting it to behave properly 
(controlling it's spidering behavior/paths transversed), much less
luck in 
determining where I should throw my lucene hooks into the larm
source.

Any suggestions or pointers appreciated.
Dominic
madison.com

_
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. 
http://join.msn.com/?page=features/virus


--
To unsubscribe, e-mail:  
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]






--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




prevent re-indexing

2002-12-09 Thread host unknown
Hi all,

I have a rather large file system that I'm indexing (php/html files 
actually).  I'm reindexing on a daily basis, however I don't want/need to 
reindex 95+% of my files since they're not going to change.

Is there currently the capiblilty to look at the last modified date and 
check it against the file that has already been indexed before re-indexing 
the file?  Or is this something that needs to be implemented?

Thanks again,
Dominic
madison.com

PS.  Thanks for the quick responses last time...the spider is starting to 
behave :-)





_
The new MSN 8: smart spam protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: prevent re-indexing

2002-12-09 Thread Otis Gospodnetic
That's an application specific behaviour that you need to add to your
indexing app.

Otis

--- host unknown [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have a rather large file system that I'm indexing (php/html files 
 actually).  I'm reindexing on a daily basis, however I don't
 want/need to 
 reindex 95+% of my files since they're not going to change.
 
 Is there currently the capiblilty to look at the last modified date
 and 
 check it against the file that has already been indexed before
 re-indexing 
 the file?  Or is this something that needs to be implemented?
 
 Thanks again,
 Dominic
 madison.com
 
 PS.  Thanks for the quick responses last time...the spider is
 starting to 
 behave :-)
 
 
 
 
 
 _
 The new MSN 8: smart spam protection and 2 months FREE*  
 http://join.msn.com/?page=features/junkmail
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




A newbie Question

2002-12-09 Thread alex
hi all

I was running the demo java org.apache.lucene.demo.IndexFiles
{full-path-to-lucene}/src
and it says it will produce a subdirctory called index: but i can't find
it . Do any one know where it is kept ?

thanks

alan


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: prevent re-indexing

2002-12-09 Thread Jonathan Reichhold
I agree with Otis on this.  In your application that is indexing, save
the last time you started indexing.  Then next time you index, read the
previous time in and just index file modified since this date.  This
doesn't deal with deletes, but that would require a bit more work

Jonathan

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] 
Sent: Monday, December 09, 2002 1:20 PM
To: Lucene Users List
Subject: Re: prevent re-indexing


That's an application specific behaviour that you need to add to your
indexing app.

Otis

--- host unknown [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have a rather large file system that I'm indexing (php/html files
 actually).  I'm reindexing on a daily basis, however I don't
 want/need to 
 reindex 95+% of my files since they're not going to change.
 
 Is there currently the capiblilty to look at the last modified date 
 and check it against the file that has already been indexed before
 re-indexing 
 the file?  Or is this something that needs to be implemented?
 
 Thanks again,
 Dominic
 madison.com
 
 PS.  Thanks for the quick responses last time...the spider is starting

 to behave :-)
 
 
 
 
 
 _
 The new MSN 8: smart spam protection and 2 months FREE*
 http://join.msn.com/?page=features/junkmail
 
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: A newbie Question

2002-12-09 Thread M Srinivas Rao
if u c the api given by lucene, there u will get details of how
to run the programs(samples).  
bfore running the search program u have to index the files u
need to search, for this first u need to run the indexing
program, that creates a folder in the current directoy structure
with the name 'index'. 

rgds
srinivas

--- alex [EMAIL PROTECTED] wrote:
 hi all
 
 I was running the demo java org.apache.lucene.demo.IndexFiles
 {full-path-to-lucene}/src
 and it says it will produce a subdirctory called index: but
 i can't find
 it . Do any one know where it is kept ?
 
 thanks
 
 alan
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]