Re: Incrementally updating and monitoring the index

2004-02-16 Thread lucene
On Friday 13 February 2004 19:10, Stephane James Vaucher wrote: > Very possible, before adding a document, you can check (with the judicious > use of an id) if it has already been added. If it hasn't, do your > notification, but this requires programming. So you mean adding the new documents to a

Re: Did you mean...

2004-02-16 Thread lucene
On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote: > As mentioned the only way I can see is to get the output of the analyzer > directly as a TokenStream > iterate through it and insert it into a Map. Could you provide or point me to some example code on how to get and use TokenStr

RE: Did you mean...

2004-02-16 Thread Viparthi, Kiran (AFIS)
Hi Timo, I was mentioning to your previous code that you can collect all the text from term. IndexReader reader = IndexReader.open(ram); TermEnum te = reader.terms(); StringBuffer sb = new StringBuffer(); while(te.next()) { Term t = te.term(); sb.append(t.text()); } And you can get the tokens u

thanks for your mail

2004-02-16 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote: > As mentioned I didn't use any information from index so I didn't uses any > TokenStream but let me check it out. deprecated: String description = doc.getField("contents").stringValue(); final java.io.Reader r = new StringReader(des

Re: Did you mean...

2004-02-16 Thread Erik Hatcher
On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote: As mentioned I didn't use any information from index so I didn't uses any TokenStream but let me check it out. deprecated: String description = doc.getField("contents").stringValu

SearchBlox J2EE Search Component Version 1.2 released

2004-02-16 Thread Robert Selvaraj
SearchBlox is a J2EE Search Component that delivers out-of-the-box search functionality for quick integration with your websites, applications, intranets and portals. SearchBlox uses the Lucene Search API and incorporates integrated HTTP and File System crawlers, support for various document fo

Re: SearchBlox J2EE Search Component Version 1.2 released

2004-02-16 Thread Eric Jain
> - Support for PowerPoint documents May I ask how you extract text from PowerPoint documents? Any open source tool, or your own code? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED

Re: SearchBlox J2EE Search Component Version 1.2 released

2004-02-16 Thread Robert Selvaraj
This is our own code. Eric Jain wrote: - Support for PowerPoint documents May I ask how you extract text from PowerPoint documents? Any open source tool, or your own code? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:40, Erik Hatcher wrote: > On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: > > String description = doc.getField("contents").stringValue(); > > What is the value of description here? ? The value of the field "contents" :-) Long, plain text.. > > final java.io.Re

Re: Did you mean...

2004-02-16 Thread Erik Hatcher
On Feb 16, 2004, at 7:59 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 12:40, Erik Hatcher wrote: On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: String description = doc.getField("contents").stringValue(); What is the value of description here? ? The value of the field "contents" :

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:16, Erik Hatcher wrote: > And thus the nature of the problem. Try using the WhitespaceAnalyzer > instead to see what you get. Much better! :-) But sometimes it still returns multiple words as a single term...:-\ And it does not care for punctuation, but that's prob

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:27, [EMAIL PROTECTED] wrote: > But sometimes it still returns multiple words as a single term...:-\ Sorry, silly mistake of mine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:12, [EMAIL PROTECTED] wrote: > deprecated: > > String description = doc.getField("contents").stringValue(); > final java.io.Reader r = new StringReader(description); > final TokenStream in = analyzer.tokenStream(r); > for (Token token; (token = in.next()) != null; ) >

Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:16, Erik Hatcher wrote: > And thus the nature of the problem. Try using the WhitespaceAnalyzer > instead to see what you get. Can I chain multiple analyzer in order to filter common stop words? Timo --

Word not in index

2004-02-16 Thread lucene
Hi! I do build a list of all unique words in all my docs from WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a GermanAnalyzer in another index. There are plenty of word in the word list that don't return any hits when searching the doc index built using the GermanAnalyzer

Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:20, [EMAIL PROTECTED] wrote: > Why is this? Another curiosity is that apparently the case does matter: "albert" (Einstein :) does return hits, but "Albert" does not - despite the docs contain "Albert" and not "albert". Can somebody explain? Thanks! Timo -

Re: Word not in index

2004-02-16 Thread Markus Spath
[EMAIL PROTECTED] wrote: Hi! I do build a list of all unique words in all my docs from WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a GermanAnalyzer in another index. There are plenty of word in the word list that don't return any hits when searching the doc index built u

Re: Word not in index

2004-02-16 Thread Otis Gospodnetic
Searches ARE case sensitive, it is just that some Analyzers lowercase all tokens. If you are using WhitespaceAnalyzer, then tokens will not be lowercased, so a search for albert and Albert may yield different results. Otis --- [EMAIL PROTECTED] wrote: > On Monday 16 February 2004 19:20, [EMAIL P

Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:57, Otis Gospodnetic wrote: > Searches ARE case sensitive, it is just that some Analyzers lowercase > all tokens. If you are using WhitespaceAnalyzer, then tokens will not GermanAnalyzer apparently is one of them. Too bad :-( Is there a case-sensitive alternative ou

thanks for your mail

2004-02-16 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:45, Markus Spath wrote: > Analyzers preprocess the text to be indexed; different Analyzers will > generate different text-tokens that are indexed. only you can know which > Analyzer fits your needs, but you need to apply this one consistently for > indexing, searching

Re: Word not in index

2004-02-16 Thread Otis Gospodnetic
Custom? :) Otis --- [EMAIL PROTECTED] wrote: > On Monday 16 February 2004 19:57, Otis Gospodnetic wrote: > > Searches ARE case sensitive, it is just that some Analyzers > lowercase > > all tokens. If you are using WhitespaceAnalyzer, then tokens will > not > > GermanAnalyzer apparently is one o

Re: Word not in index

2004-02-16 Thread Otis Gospodnetic
Timo, by the nature of your questions it seems like you didn't see the Articles section of Lucene's site. There are links to several articles there. A few of them explain indexing (intro + more advanced), at least one explains QueryParser and maybe Analyzer, and a few explain vanilla searching.

Re: Did you mean...

2004-02-16 Thread Erik Hatcher
On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote: Can somebody explain tokenStream() to me? You are now venturing under the covers of Lucene's API. This is where I give the sage advice to get the Lucene source code and surf around it a bit. (It helps to have a nice IDE where you can click a

Re: Did you mean...

2004-02-16 Thread Erik Hatcher
On Feb 16, 2004, at 10:34 AM, [EMAIL PROTECTED] wrote: On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Can I chain multiple analyzer in order to filter common stop words? You cannot chain Analyz

Re: Word not in index

2004-02-16 Thread Erik Hatcher
Timo, You are asking a lot of good questions, but also questions for which answers already exist. Just dig a little deeper and you will see. Have a look at my java.net article (titled "Lucene Intro") and you will find utility code that hilights how analyzers work. Tinker with that a bit, th

Re: thanks for your mail

2004-02-16 Thread Leo Galambos
Could an admin filter out hema's e-mails, please? THX Leo [EMAIL PROTECTED] wrote: Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 20:56, Otis Gospodnetic wrote: > Timo, by the nature of your questions it seems like you didn't see the > Articles section of Lucene's site. There are links to several articles > > --- [EMAIL PROTECTED] wrote: > > Well, not sure whether I understood. Well, was actually

Re: 'Sponsored' links

2004-02-16 Thread Daniel B. Davis
The index contains documents, an unknown number of which are sponsored. The number of sponsors are small, not necessarily the documents count. In all of #1, #2, and #3, the sponsorship information must be accessed to determine sponsorship, and that information is indeed outside of the primary

Re: 'Sponsored' links

2004-02-16 Thread Doug Cutting
Daniel B. Davis wrote: Are there other strategies not considered? Why not store sponsored documents in a separate index, separately searched, whose results are placed above those from the non-sponsored documents? Doug - To unsu