On Friday 13 February 2004 19:10, Stephane James Vaucher wrote:
> Very possible, before adding a document, you can check (with the judicious
> use of an id) if it has already been added. If it hasn't, do your
> notification, but this requires programming.
So you mean adding the new documents to a
On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote:
> As mentioned the only way I can see is to get the output of the analyzer
> directly as a TokenStream
> iterate through it and insert it into a Map.
Could you provide or point me to some example code on how to get and use
TokenStr
Hi Timo,
I was mentioning to your previous code that you can collect all the text
from term.
IndexReader reader = IndexReader.open(ram);
TermEnum te = reader.terms();
StringBuffer sb = new StringBuffer();
while(te.next())
{
Term t = te.term();
sb.append(t.text());
}
And you can get the tokens u
Received your mail we will get back to you shortly
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote:
> As mentioned I didn't use any information from index so I didn't uses any
> TokenStream but let me check it out.
deprecated:
String description = doc.getField("contents").stringValue();
final java.io.Reader r = new StringReader(des
On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote:
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote:
As mentioned I didn't use any information from index so I didn't uses
any
TokenStream but let me check it out.
deprecated:
String description = doc.getField("contents").stringValu
SearchBlox is a J2EE Search Component that delivers out-of-the-box
search functionality for quick integration with your websites,
applications, intranets and portals. SearchBlox uses the Lucene Search
API and incorporates integrated HTTP and File System crawlers, support
for various document fo
> - Support for PowerPoint documents
May I ask how you extract text from PowerPoint documents? Any open
source tool, or your own code?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED
This is our own code.
Eric Jain wrote:
- Support for PowerPoint documents
May I ask how you extract text from PowerPoint documents? Any open
source tool, or your own code?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additio
On Monday 16 February 2004 12:40, Erik Hatcher wrote:
> On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote:
> > String description = doc.getField("contents").stringValue();
>
> What is the value of description here?
? The value of the field "contents" :-) Long, plain text..
> > final java.io.Re
On Feb 16, 2004, at 7:59 AM, [EMAIL PROTECTED] wrote:
On Monday 16 February 2004 12:40, Erik Hatcher wrote:
On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote:
String description = doc.getField("contents").stringValue();
What is the value of description here?
? The value of the field "contents" :
On Monday 16 February 2004 15:16, Erik Hatcher wrote:
> And thus the nature of the problem. Try using the WhitespaceAnalyzer
> instead to see what you get.
Much better! :-) But sometimes it still returns multiple words as a single
term...:-\
And it does not care for punctuation, but that's prob
On Monday 16 February 2004 15:27, [EMAIL PROTECTED] wrote:
> But sometimes it still returns multiple words as a single term...:-\
Sorry, silly mistake of mine.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands,
On Monday 16 February 2004 12:12, [EMAIL PROTECTED] wrote:
> deprecated:
>
> String description = doc.getField("contents").stringValue();
> final java.io.Reader r = new StringReader(description);
> final TokenStream in = analyzer.tokenStream(r);
> for (Token token; (token = in.next()) != null; )
>
On Monday 16 February 2004 15:16, Erik Hatcher wrote:
> And thus the nature of the problem. Try using the WhitespaceAnalyzer
> instead to see what you get.
Can I chain multiple analyzer in order to filter common stop words?
Timo
--
Hi!
I do build a list of all unique words in all my docs from
WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a
GermanAnalyzer in another index. There are plenty of word in the word list
that don't return any hits when searching the doc index built using the
GermanAnalyzer
On Monday 16 February 2004 19:20, [EMAIL PROTECTED] wrote:
> Why is this?
Another curiosity is that apparently the case does matter:
"albert" (Einstein :) does return hits, but "Albert" does not - despite the
docs contain "Albert" and not "albert".
Can somebody explain?
Thanks!
Timo
-
[EMAIL PROTECTED] wrote:
Hi!
I do build a list of all unique words in all my docs from
WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a
GermanAnalyzer in another index. There are plenty of word in the word list
that don't return any hits when searching the doc index built u
Searches ARE case sensitive, it is just that some Analyzers lowercase
all tokens. If you are using WhitespaceAnalyzer, then tokens will not
be lowercased, so a search for albert and Albert may yield different
results.
Otis
--- [EMAIL PROTECTED] wrote:
> On Monday 16 February 2004 19:20, [EMAIL P
On Monday 16 February 2004 19:57, Otis Gospodnetic wrote:
> Searches ARE case sensitive, it is just that some Analyzers lowercase
> all tokens. If you are using WhitespaceAnalyzer, then tokens will not
GermanAnalyzer apparently is one of them. Too bad :-( Is there a
case-sensitive alternative ou
Received your mail we will get back to you shortly
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Monday 16 February 2004 19:45, Markus Spath wrote:
> Analyzers preprocess the text to be indexed; different Analyzers will
> generate different text-tokens that are indexed. only you can know which
> Analyzer fits your needs, but you need to apply this one consistently for
> indexing, searching
Custom? :)
Otis
--- [EMAIL PROTECTED] wrote:
> On Monday 16 February 2004 19:57, Otis Gospodnetic wrote:
> > Searches ARE case sensitive, it is just that some Analyzers
> lowercase
> > all tokens. If you are using WhitespaceAnalyzer, then tokens will
> not
>
> GermanAnalyzer apparently is one o
Timo, by the nature of your questions it seems like you didn't see the
Articles section of Lucene's site. There are links to several articles
there. A few of them explain indexing (intro + more advanced), at
least one explains QueryParser and maybe Analyzer, and a few explain
vanilla searching.
On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote:
Can somebody explain tokenStream() to me?
You are now venturing under the covers of Lucene's API. This is where
I give the sage advice to get the Lucene source code and surf around it
a bit. (It helps to have a nice IDE where you can click a
On Feb 16, 2004, at 10:34 AM, [EMAIL PROTECTED] wrote:
On Monday 16 February 2004 15:16, Erik Hatcher wrote:
And thus the nature of the problem. Try using the WhitespaceAnalyzer
instead to see what you get.
Can I chain multiple analyzer in order to filter common stop words?
You cannot chain Analyz
Timo,
You are asking a lot of good questions, but also questions for which
answers already exist. Just dig a little deeper and you will see.
Have a look at my java.net article (titled "Lucene Intro") and you will
find utility code that hilights how analyzers work. Tinker with that a
bit, th
Could an admin filter out hema's e-mails, please?
THX
Leo
[EMAIL PROTECTED] wrote:
Received your mail we will get back to you shortly
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Monday 16 February 2004 20:56, Otis Gospodnetic wrote:
> Timo, by the nature of your questions it seems like you didn't see the
> Articles section of Lucene's site. There are links to several articles
>
> --- [EMAIL PROTECTED] wrote:
> > Well, not sure whether I understood.
Well, was actually
The index contains documents, an unknown number of which are
sponsored. The number of sponsors are small, not necessarily the documents
count. In all of #1, #2, and #3, the sponsorship information must be
accessed to determine sponsorship, and that information is indeed outside
of the primary
Daniel B. Davis wrote:
Are there other strategies not considered?
Why not store sponsored documents in a separate index, separately
searched, whose results are placed above those from the non-sponsored
documents?
Doug
-
To unsu
31 matches
Mail list logo