RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Adam Saltiel
Herb,
Any one game ... ?
No takers? I would be very interested, but maybe beyond what can be
posted in a mail list. I'd be equally interested in any references you
may have.
As we are on this subject how does LSI and the similar CNG (context
network graph) fit into the model used by lucene. Could lucene be
massaged to implement different mathematical models of search and
retrieval, if so how modular are the core functions?

Adam Saltiel


 -Original Message-
 From: Chong, Herb [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 04, 2003 1:53 PM
 To: Lucene Users List
 Subject: RE: Probabilistic Model in Lucene - possible?

 not all tf/idf variants are probabilistic models, but a great many are
if
 the term weights are probabilities. if we just take straight,
unmodified
 Term Frequency in a document, Inverse Document Frequency in the
corpus,
 and the Term Frequency in the query as 1, you are in fact comparing
the
 statistical properties of the query against the statistical properties
of
 the query. they are probabilities you are comparing. i can't think of
many
 papers that come right out and say it, but if you look at an
individual
 term weight and can interpret it as a genuine probability, the vector
 space model based on the weights is a probabilistic model. the
derivation
 is relatively straight forward to show it, if you have the right
general
 model to start with. once you start throwing in ad hoc normalizations,
 then things get out of whack and it's not longer a probabilistic
model.

 the implementations that i have done are with a former company and
that
 means secret and protected by various intellectual property rights.
 however, i can sketch here the general approach one has to take and an
 outline of the derivation that unifies probabilistic models with
vector
 space models and at the same time incorporate pairwise interterm
 correlation. in fact, the pairwise interterm correlations are a
 fundamental assumption. once you do all this, you can show that the
 traditional vector space model is a special case of a pairwise
interterm
 correlation model. for those that are interested in advanced matrix
 algebra and some basic statistics, it should be very interesting. if
only
 i had a published paper, i would post it. unfortunately, what i have
is
 very obtuse because it's protected. the only paper that started out
was
 submitted to SIGIR but rejected by all but one referee. that one
thought
 this was a tremendous unification of the two methods, but academic
 journals being what they are, when 4 out of 5 referees can't
understand
 the paper, it doesn't get published. i may brush it off and enlarge
into a
 much longer paper for the Journal of IR, but once again, unless you
are
 comfortable with probability theory and matrix theory, you are not
going
 to follow it.

 so, who is game for a tutorial on the derivation?

 Herb...

 -Original Message-
 From: Karsten Konrad [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 04, 2003 5:09 AM
 To: Lucene Users List
 Subject: AW: Probabilistic Model in Lucene - possible?



 Hi Herb,

 thank you for your insights.

 
 but by most accepted definitions, the tf/idf model in Lucene is a
 probabilistic model.
 

 Can you send some pointers to help me understand that? Are all TF/IDF-
 variants
 probabilistic models? If so, what makes any model a non-probabilistic
one?
 If you claim that TF/IDF is probabilistic, then the plain cosine (an
 extreme
 form of TF/IDF, with IDF for all terms being considered constant) of
VSM
 would
 also be a probabilistic model.

 
 it's got strange normalizations though that doesn't allow comparisons
of
 rank values across queries.
 

 Lucene's internal ranking sometimes returns values  1.0, these are
then
 normalized to 1.0,
 adjusting other rankings accordingly. While I have nothing to say
against
 this - it's a hack,
 but useful - it makes comparing the rank values across queries really
 difficult. It's like
 using different scales whenever you measure something different, and
then
 you do not tell
 anyone about it.

 
 it isn't terribly hard to make a normalized probabilistic model that
 allows comparing of document scores across queries and assign a
meaning to
 the score. i've done it.
 

 Stop bragging, send us your Similarity implementation :)

 Regards,

 Karsten


 -Ursprüngliche Nachricht-
 Von: Chong, Herb [mailto:[EMAIL PROTECTED]
 Gesendet: Mittwoch, 3. Dezember 2003 23:01
 An: Lucene Users List
 Betreff: RE: Probabilistic Model in Lucene - possible?


 i think i am missing the original question, but by most accepted
 definitions, the tf/idf model in Lucene is a probabilistic model. it's
got
 strange normalizations though that doesn't allow comparisons of rank
 values across queries.

 it isn't terribly hard to make a normalized probabilistic model that
 allows comparing of document scores across queries and assign a
meaning to
 the score. i've done it. however, that means abandoning idf and
keeping
 

Re: Testing for Optimization

2003-12-05 Thread jt oob
 --- Dror Matalon [EMAIL PROTECTED] wrote:  I believe that indexes
that are optimized have only one segment. So
 in
 theory you could check and see that you only have one file with a
 .fdt, .fdx, etc. 

If run `cat/index_dir/segements` on an optimized index there is only 
only string in there. It matches up with prefix of files in the index
directory.

If i run the same on an un-optimized index dir then i get back several
strings.

There are more files in the optimized index dir than just the ones with
the prefix listed in the segments file.
Have i corrupted my indexes?
Can I safely delete those files which do not have the prefix listed in
the segments file?

Thanks,
jt


Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs
http://www.yahoo.co.uk/robbiewilliams

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Range Query

2003-12-05 Thread Ramrakhiani, Vikas
Hi,
When I do range query like id:[0* to 9*] the result set exclude documents
having id 0, 90 ... i.e boundary values are excluded.
Is it expected or am I going wrong some where.
thanks,
vikas. 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Shengli.Wu
Deal all,

I am interested in implement a probabilistic model in Lucene as well.
I checked the book titled model information retrieval authored by Ricardo 
Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the 
implementation is not very complicated when we use Lucene's IndexReader 
class, almost all the parameters needed are there: the total number of 
document in the index (collection), the number of documents having a 
particular term, that is it. Probably we need to find out a satisfied method
of defining the weights of terms in the documents as well as in the query.

Cheers,

Shengli



Adam Saltiel [EMAIL PROTECTED] said:

 Herb,
 Any one game ... ?
 No takers? I would be very interested, but maybe beyond what can be
 posted in a mail list. I'd be equally interested in any references you
 may have.
 As we are on this subject how does LSI and the similar CNG (context
 network graph) fit into the model used by lucene. Could lucene be
 massaged to implement different mathematical models of search and
 retrieval, if so how modular are the core functions?
 
 Adam Saltiel
 
 
  -Original Message-
  From: Chong, Herb [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 04, 2003 1:53 PM
  To: Lucene Users List
  Subject: RE: Probabilistic Model in Lucene - possible?
 
  not all tf/idf variants are probabilistic models, but a great many are
 if
  the term weights are probabilities. if we just take straight,
 unmodified
  Term Frequency in a document, Inverse Document Frequency in the
 corpus,
  and the Term Frequency in the query as 1, you are in fact comparing
 the
  statistical properties of the query against the statistical properties
 of
  the query. they are probabilities you are comparing. i can't think of
 many
  papers that come right out and say it, but if you look at an
 individual
  term weight and can interpret it as a genuine probability, the vector
  space model based on the weights is a probabilistic model. the
 derivation
  is relatively straight forward to show it, if you have the right
 general
  model to start with. once you start throwing in ad hoc normalizations,
  then things get out of whack and it's not longer a probabilistic
 model.
 
  the implementations that i have done are with a former company and
 that
  means secret and protected by various intellectual property rights.
  however, i can sketch here the general approach one has to take and an
  outline of the derivation that unifies probabilistic models with
 vector
  space models and at the same time incorporate pairwise interterm
  correlation. in fact, the pairwise interterm correlations are a
  fundamental assumption. once you do all this, you can show that the
  traditional vector space model is a special case of a pairwise
 interterm
  correlation model. for those that are interested in advanced matrix
  algebra and some basic statistics, it should be very interesting. if
 only
  i had a published paper, i would post it. unfortunately, what i have
 is
  very obtuse because it's protected. the only paper that started out
 was
  submitted to SIGIR but rejected by all but one referee. that one
 thought
  this was a tremendous unification of the two methods, but academic
  journals being what they are, when 4 out of 5 referees can't
 understand
  the paper, it doesn't get published. i may brush it off and enlarge
 into a
  much longer paper for the Journal of IR, but once again, unless you
 are
  comfortable with probability theory and matrix theory, you are not
 going
  to follow it.
 
  so, who is game for a tutorial on the derivation?
 
  Herb...
 
  -Original Message-
  From: Karsten Konrad [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 04, 2003 5:09 AM
  To: Lucene Users List
  Subject: AW: Probabilistic Model in Lucene - possible?
 
 
 
  Hi Herb,
 
  thank you for your insights.
 
  
  but by most accepted definitions, the tf/idf model in Lucene is a
  probabilistic model.
  
 
  Can you send some pointers to help me understand that? Are all TF/IDF-
  variants
  probabilistic models? If so, what makes any model a non-probabilistic
 one?
  If you claim that TF/IDF is probabilistic, then the plain cosine (an
  extreme
  form of TF/IDF, with IDF for all terms being considered constant) of
 VSM
  would
  also be a probabilistic model.
 
  
  it's got strange normalizations though that doesn't allow comparisons
 of
  rank values across queries.
  
 
  Lucene's internal ranking sometimes returns values  1.0, these are
 then
  normalized to 1.0,
  adjusting other rankings accordingly. While I have nothing to say
 against
  this - it's a hack,
  but useful - it makes comparing the rank values across queries really
  difficult. It's like
  using different scales whenever you measure something different, and
 then
  you do not tell
  anyone about it.
 
  
  it isn't terribly hard to make a normalized probabilistic model that
  allows comparing of document scores across queries and assign a
 meaning 

RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Chong, Herb
anyone interested, contact me offline. whoever contacts me by the end of next week, 
i'll email an outline of the derivation and we can discuss it in private emails. i 
guarantee, you will learn something interesting about search engines.

Herb

-Original Message-
From: Adam Saltiel [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 3:46 AM
To: 'Lucene Users List'
Subject: RE: Probabilistic Model in Lucene - possible?


Herb,
Any one game ... ?
No takers? I would be very interested, but maybe beyond what can be
posted in a mail list. I'd be equally interested in any references you
may have.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index and Field.Text

2003-12-05 Thread Grant Ingersoll
Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the 
authors for contributing!) and have successfully adapted this approach for my 
application.  The one thing that does not sit well with me is the fact that I am using 
the method Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.

Some questions:

1. Should I care?  What is the cost of storing the contents of these files versus 
using the Reader based method.  Presumably, the index size is going to be larger, but 
will it adversaly effect search time?  If yes, how much so (relatively speaking)?

2. If storing the content is going to adversaly effect searching, has anyone written 
an XMLReader that extends java.io.Reader.  I guess it would need to take in the name 
of the tag(s) that you want the reader to retrieve and then extend all of the 
java.io.Reader results to return values based on just the tag values that I am 
interested in.  Has anyone taken this approach?  If not, does it at least seem like a 
valid approach?

Thanks for your help!

-Grant Ingersoll



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index and Field.Text

2003-12-05 Thread Chong, Herb
you are storing the same information both ways. the string gets analyzed and 
discarded, just like with the Reader.

Herb...

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 9:49 AM
To: [EMAIL PROTECTED]
Subject: Index and Field.Text


Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the 
authors for contributing!) and have successfully adapted this approach for my 
application.  The one thing that does not sit well with me is the fact that I am using 
the method Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



class definition used in Lucene

2003-12-05 Thread Shengli.Wu

hi, 

I have problems for understanding some classes definitions in Lucene
(see the end of this e-mail for the source code).

A class FilterIndexReader is defined at 1. 
Then FilterTermDocs is defined as a nested static class at 2.
 
At 3, 

public FilterTermDocs(TermDocs in) 

is a constructor. What I am not understand are as follows: 

1. Now that FilterTermDocs is a static class, then why it has a constructor 
at 3?

2. Why we can use (TermDocs in) for the constructor at 3? Here TermDocs is 
an interface, does that mean in is an object of TermDocs?
Thanks in advance for your help!

Best,

Shengli


1 public class FilterIndexReader extends IndexReader {

  /** Base class for filtering [EMAIL PROTECTED] TermDocs} implementations. */
2  public static class FilterTermDocs implements TermDocs {3protected
   TermDocs in;

3  public FilterTermDocs(TermDocs in) { this.in = in; }
...




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Ok thanks, but still I can't use the Simple analyzer since it won't even
index that whole thing. I 'll give TermQuery a try. Thanks.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result


You really should use a TermQuery in this case anyway, rather than 
using QueryParser.  You wouldn't have to worry about the analyzer at 
that point anyway (and I assume you're using Field.Keyword during 
indexing).

Erik


On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

 Ok I realized teh Simple Analyzer does not index numbers, so I
switched
 back to Standard.

 -Original Message-
 From: Pleasant, Tracy
 Sent: Thursday, December 04, 2003 4:53 PM
 To: Lucene Users List
 Subject: Returning one result


  I am indexing a group of items and one field , id, is unique.  When 
 the
 user clicks on a results I want just that one result to show.

  I index and search using SimpleAnalyzer.


  Query query_es = QueryParser.parse(query, id, new
SimpleAnalyzer());

  It should return only one result but returns 200.





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Maybe I should have been more clear.

static Field Keyword(String name, String value) 
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored. 

I need to have it tokenized because people will search for that also and
it needs to be searchable. 

Should I have two fields - one as a keyword and one as text? 


How would I do that when I want to return search results..

Right now, in the results page it will have something like
a href=display_record.jsp?id=AR334Record AR334/a 

Then in display_record.jsp:
 Searcher searcher = new IndexSearcher(index);
 String term = request.getParameter(id);

 Query query = QueryParser.parse(term, id, new
StandardAnalyzer());

 Hits hits  = searcher.search(query);

Would it have to be something like:
 TermQuery query = ???

or 
 Query query = QueryParser.Term(id);

? ? ? 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result


You really should use a TermQuery in this case anyway, rather than 
using QueryParser.  You wouldn't have to worry about the analyzer at 
that point anyway (and I assume you're using Field.Keyword during 
indexing).

Erik


On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

 Ok I realized teh Simple Analyzer does not index numbers, so I
switched
 back to Standard.

 -Original Message-
 From: Pleasant, Tracy
 Sent: Thursday, December 04, 2003 4:53 PM
 To: Lucene Users List
 Subject: Returning one result


  I am indexing a group of items and one field , id, is unique.  When 
 the
 user clicks on a results I want just that one result to show.

  I index and search using SimpleAnalyzer.


  Query query_es = QueryParser.parse(query, id, new
SimpleAnalyzer());

  It should return only one result but returns 200.





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 10:31  AM, Pleasant, Tracy wrote:
Ok thanks, but still I can't use the Simple analyzer since it won't 
even
index that whole thing. I 'll give TermQuery a try. Thanks.


Yes, certainly the analyzer is important for analyzed fields, but it 
is not used for Field.Keyword.  Please provide more details on the 
issue you encountered using Field.Keyword.



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result
You really should use a TermQuery in this case anyway, rather than
using QueryParser.  You wouldn't have to worry about the analyzer at
that point anyway (and I assume you're using Field.Keyword during
indexing).
	Erik

On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

Ok I realized teh Simple Analyzer does not index numbers, so I
switched
back to Standard.

-Original Message-
From: Pleasant, Tracy
Sent: Thursday, December 04, 2003 4:53 PM
To: Lucene Users List
Subject: Returning one result
 I am indexing a group of items and one field , id, is unique.  When
the
user clicks on a results I want just that one result to show.
 I index and search using SimpleAnalyzer.

 Query query_es = QueryParser.parse(query, id, new
SimpleAnalyzer());
 It should return only one result but returns 200.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
Erik,

Below are the results of a debug run on the piece of text that I want
aliased. The token spitline must be recognized as splitline i.e. when I
do a search for splitline, this record will come up.

1: [173] , start:1, end:2
1: [missing] , start:1, end:6
2: [hardware] , start:9, end:7
3: [for] , start:18, end:2
4: [bypass] , start:22, end:5
5: [spitline] , start:29, end:37

I also added extra debug info after the token text, which are the
startOffset, and the endOffset. Lucene has the first token 173 only
stored, it is not indexed. The remaining terms are tokenized, indexed and
stored. Does this make a difference?

Allen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: implementing a TokenFilter for aliases

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 11:59  AM, Allen Atamer wrote:
Below are the results of a debug run on the piece of text that I want
aliased. The token spitline must be recognized as splitline i.e. 
when I
do a search for splitline, this record will come up.

1: [173] , start:1, end:2
1: [missing] , start:1, end:6
2: [hardware] , start:9, end:7
3: [for] , start:18, end:2
4: [bypass] , start:22, end:5
5: [spitline] , start:29, end:37
I also added extra debug info after the token text, which are the
startOffset, and the endOffset. Lucene has the first token 173 only
stored, it is not indexed. The remaining terms are tokenized, indexed 
and
stored. Does this make a difference?
I don't understand what you mean by 173 - is that output from a 
different string being analyzed?

Well, it's obvious from this output that you cannot find spitline 
when splitline is used in a search.  Your analyzer isn't working as 
you expect, I'm guessing.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index and Field.Text

2003-12-05 Thread Doug Cutting
Tatu Saloranta wrote:
Also, shouldn't there be at least 3 methods that take Readers; one for 
Text-like handling, another for UnStored, and last for UnIndexed.
How do you store the contents of a Reader?  You'd have to double-buffer 
it, first reading it into a String to store, and then tokenizing the 
StringReader.  A key feature of Reader values is that they're streamed: 
the entire value is never in RAM.  Storing a Reader value would remove 
that advantage.  The current API makes this explicit: when you want 
something streamed, you pass in a Reader, when you're willing to have 
the entire value in memory, pass in a String.

Yes, it is a bit confusing that Text(String, String) stores its value, 
while Text(String, Reader) does not, but it is at least well documented. 
 And we cannot change it: that would break too many applications.  But 
we can put this on the list for Lucene 2.0 cleanups.

When I first wrote these static methods I meant for them to be 
constructor-like.  I wanted to have multiple Field(String, String) 
constructors, but that's not possible, so I used capitalized static 
methods instead.  I've never seen anyone else do this (capitalize any 
method but a real constructor) so I guess I didn't start a fad!  This 
should someday too be cleaned up.  Lucene was the first Java program 
that I ever wrote, and thus its style is in places non-standard.  Sorry.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
What I meant is.

Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not a
keyword.

So after using
TermQuery query = new TermQuery(new Term(id, term));

How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 11:32 AM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
 Maybe I should have been more clear.

 static Field Keyword(String name, String value)
   Constructs a String-valued Field that is not tokenized, but 
 is
 indexed and stored.

 I need to have it tokenized because people will search for that also 
 and
 it needs to be searchable.

Search for *what* also?  Tokenized means that it is broken into pieces 
which will be separate terms.  For example: see spot is tokenized 
into see and spot, and searching for either of those terms will 
match.

Just try it and see, please!  :)

 Should I have two fields - one as a keyword and one as text?

Depends on what you're doing... but an id field to me indicates 
Field.Keyword to me, only.

 How would I do that when I want to return search results..

  Searcher searcher = new IndexSearcher(index);
  String term = request.getParameter(id);

  Query query = QueryParser.parse(term, id, new
 StandardAnalyzer());

  Hits hits  = searcher.search(query);

 Would it have to be something like:
  TermQuery query = ???

Yes.  TermQuery query = new TermQuery(new Term(id, term));

Use searcher.search exactly as you did before.  Just don't use 
QueryParser to construct a query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
173 is the ID field from a database (which we use as a primary key). For
Lucene's purpose, it only stores the field, and does not index it.

The place where I put the print statements is before the actual filtering.
The goal of the AliasFilter is to replace spitline. The debug line is in the
Tokenizer, and the filters are run afterwards so I am not sure what is
happening inside lucene.

I can't put the util line into the analyzer after the AliasFilter is run
because it will call recursively into tokenStream() and cause a stack
overflow. I will try to work on seeing what is happening after aliasfilter
is run

Allen


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: December 5, 2003 12:23 PM
 To: Lucene Users List
 Subject: Re: implementing a TokenFilter for aliases
 
 On Friday, December 5, 2003, at 11:59  AM, Allen Atamer wrote:
  Below are the results of a debug run on the piece of text that I want
  aliased. The token spitline must be recognized as splitline i.e.
  when I
  do a search for splitline, this record will come up.
 
  1: [173] , start:1, end:2
  1: [missing] , start:1, end:6
  2: [hardware] , start:9, end:7
  3: [for] , start:18, end:2
  4: [bypass] , start:22, end:5
  5: [spitline] , start:29, end:37
 
  I also added extra debug info after the token text, which are the
  startOffset, and the endOffset. Lucene has the first token 173 only
  stored, it is not indexed. The remaining terms are tokenized, indexed
  and
  stored. Does this make a difference?
 
 I don't understand what you mean by 173 - is that output from a
 different string being analyzed?
 
 Well, it's obvious from this output that you cannot find spitline
 when splitline is used in a search.  Your analyzer isn't working as
 you expect, I'm guessing.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Also what I am indexing is not a bunch of separate documents - or then
it would be easy to simply have a field called url and then the link
would go directly do that document. 

However, there is a text URL with many records
During indexing, a function parses each record and puts each into a
document with appropriate fields. 

When I go to display a particular Document (Lucene Document) I just
query the index for that unique ID rather than go through and parse
through the URL with all the records. 

Wouldn't querying the index for that unique ID be better than going
through that entire page and parsing through it - there is more room for
error that way.  

It's a long story why there isn't a database but it can't be done (don't
ask ... long story). 

-Original Message-
From: Pleasant, Tracy 
Sent: Friday, December 05, 2003 1:25 PM
To: Lucene Users List
Subject: RE: Returning one result


What I meant is.

Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not a
keyword.

So after using
TermQuery query = new TermQuery(new Term(id, term));

How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 11:32 AM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
 Maybe I should have been more clear.

 static Field Keyword(String name, String value)
   Constructs a String-valued Field that is not tokenized, but 
 is
 indexed and stored.

 I need to have it tokenized because people will search for that also 
 and
 it needs to be searchable.

Search for *what* also?  Tokenized means that it is broken into pieces 
which will be separate terms.  For example: see spot is tokenized 
into see and spot, and searching for either of those terms will 
match.

Just try it and see, please!  :)

 Should I have two fields - one as a keyword and one as text?

Depends on what you're doing... but an id field to me indicates 
Field.Keyword to me, only.

 How would I do that when I want to return search results..

  Searcher searcher = new IndexSearcher(index);
  String term = request.getParameter(id);

  Query query = QueryParser.parse(term, id, new
 StandardAnalyzer());

  Hits hits  = searcher.search(query);

 Would it have to be something like:
  TermQuery query = ???

Yes.  TermQuery query = new TermQuery(new Term(id, term));

Use searcher.search exactly as you did before.  Just don't use 
QueryParser to construct a query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: implementing a TokenFilter for aliases

2003-12-05 Thread Doug Cutting
Position increments are for relative token positions.  A position 
increment of zero means that a token is logically at the same position 
as the previous token.  A position increment of one means that a token 
immediately follows the preceding token in the stream, it's the next 
token to the right (in a left-to-right language).  A position increment 
of two means that it is two tokens past the previous token, that there's 
a phantom token between them, inhibiting exact phrase matches.

You're setting the position increment to things based on the number of 
characters in the token's text.  That makes no sense.  Token positions 
are not character positions.  I think what you want to do is use a 
positionIncrement of zero, so the tokens lie at the same position.

Doug

Allen Atamer wrote:
The FAQ describes implementing a TokenFilter for applying aliases. I have a
trouble accomplishing this.
 
This is the code that I have so far for the next Method within AliasFilter.
After reading some posts, I also got the idea to call
setPositionIncrement(). Neither way works, because when I search for the
alias, no search results come back.
 
Thank you for your help,
 
Allen Atamer
 

 
  public Token next() throws java.io.IOException {
Token token = tokenStream.next();
 
if (aliasMap == null || token == null) {
  return token;
}
 
TermData t = (TermData)aliasMap.get(token.termText());
 
if (t == null) {
  return token;
}
 
String tokenText = AliasManager.replaceIgnoreCase(
token.termText(), t.getTerm(), t.getTeach());
 
int increment = tokenText.length() - token.termText().length();
if (increment  0) {
  token.setPositionIncrement(increment);
}
 
return new Token(tokenText, token.startOffset(), token.endOffset());
  }
 
 
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: How would you delete an entry that was indexed like this

2003-12-05 Thread Aviran
This is kind of a problem, in order to delete documents using terms you need
to have a keyword field which contain a unique value, otherwise you might
ending deleting more then you want.

-Original Message-
From: Mike Hogan [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 05, 2003 1:06 PM
To: [EMAIL PROTECTED]
Subject: How would you delete an entry that was indexed like this


Hi,

If I index a document like this:

IndexWriter writer = createWriter();
Document document = new Document(); document.add(Field.Text(ID_FIELD_NAME,
componentId)); document.add(Field.Text(CONTENTS_FIELD_NAME,
componentDescription)); writer.addDocument(document); writer.optimize();
writer.close();

What code must I execute to later delete the document (I tried following the
docs and whats done in the code and test cases.  I saw Terms being used to
ID the document to delete.  But I am not clear what value to put in the
Term, as I do not know how Terms relate to Fields).

Many thanks,
Mike.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not 
a
keyword.
*arg* - we're having a serious communication issue here.  My advice to 
you is to actually write some simple tests (test-driven learning using 
JUnit is a wonderful way to experiement with Lucene, especially thanks 
to the RAMDirectory).  Please refer to my articles at java.net as well 
as the other great Lucene articles out there.

Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
javadocs say this for this method:

  /** Constructs a String-valued Field that is not tokenized, but is 
indexed
and stored.  Useful for non-text fields, e.g. date or url.  */

[I added the emphasis there]


So after using
TermQuery query = new TermQuery(new Term(id, term));
How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.
Umm you'd use *exactly* the same way as if you had used 
QueryParser.  QueryParser would create a TermQuery for you, in fact, 
except it would analyze your text first, which is what you want to 
avoid, right?

Hits.doc(n) gives you back a Document.  And then 
Document.get(fieldName) gives you back the fields (as long as you  
stored  them in the index too).

Again, please attempt some of these things in code.  It is a trivial 
matter to index and search using RAMDirectory and experiment with 
TermQuery, QueryParser, Analyzers, etc.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Returning one result

2003-12-05 Thread Dror Matalon
On Fri, Dec 05, 2003 at 01:25:23PM -0500, Pleasant, Tracy wrote:
 What I meant is.
 
 Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
 order for it to be searchable then it would have to be indexed and not a
 keyword.

No. You should store it as a keyword. 

From the javadocs:
Keyword(String name, String value)
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored.


 
 So after using
 TermQuery query = new TermQuery(new Term(id, term));
 
 How would I return the other fields in the document?
 
 For instance to display a record it would get the record with the id #
 and then display the title, contents, etc.
 
 
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 05, 2003 11:32 AM
 To: Lucene Users List
 Subject: Re: Returning one result
 
 
 On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
  Maybe I should have been more clear.
 
  static Field Keyword(String name, String value)
Constructs a String-valued Field that is not tokenized, but 
  is
  indexed and stored.
 
  I need to have it tokenized because people will search for that also 
  and
  it needs to be searchable.
 
 Search for *what* also?  Tokenized means that it is broken into pieces 
 which will be separate terms.  For example: see spot is tokenized 
 into see and spot, and searching for either of those terms will 
 match.
 
 Just try it and see, please!  :)
 
  Should I have two fields - one as a keyword and one as text?
 
 Depends on what you're doing... but an id field to me indicates 
 Field.Keyword to me, only.
 
  How would I do that when I want to return search results..
 
   Searcher searcher = new IndexSearcher(index);
   String term = request.getParameter(id);
 
   Query query = QueryParser.parse(term, id, new
  StandardAnalyzer());
 
   Hits hits  = searcher.search(query);
 
  Would it have to be something like:
   TermQuery query = ???
 
 Yes.  TermQuery query = new TermQuery(new Term(id, term));
 
 Use searcher.search exactly as you did before.  Just don't use 
 QueryParser to construct a query.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Maybe we are having some communication issues. 

At any rate, I did index it as a KEYWORD and when displaying used the
TermQuery.

The only problem with this though is by storing the ID (i.e. AR345) as a
Keyword, if I search for AR345 no results are returned when I use the
MultiFieldQueryParser .

*sigh* *arg*



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 2:13 PM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
 Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
 order for it to be searchable then it would have to be indexed and not

 a
 keyword.

*arg* - we're having a serious communication issue here.  My advice to 
you is to actually write some simple tests (test-driven learning using 
JUnit is a wonderful way to experiement with Lucene, especially thanks 
to the RAMDirectory).  Please refer to my articles at java.net as well 
as the other great Lucene articles out there.

Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
javadocs say this for this method:

   /** Constructs a String-valued Field that is not tokenized, but is 
 indexed
 and stored.  Useful for non-text fields, e.g. date or url.  */

[I added the emphasis there]


 So after using
 TermQuery query = new TermQuery(new Term(id, term));

 How would I return the other fields in the document?

 For instance to display a record it would get the record with the id #
 and then display the title, contents, etc.

Umm you'd use *exactly* the same way as if you had used 
QueryParser.  QueryParser would create a TermQuery for you, in fact, 
except it would analyze your text first, which is what you want to 
avoid, right?

Hits.doc(n) gives you back a Document.  And then 
Document.get(fieldName) gives you back the fields (as long as you  
stored  them in the index too).

Again, please attempt some of these things in code.  It is a trivial 
matter to index and search using RAMDirectory and experiment with 
TermQuery, QueryParser, Analyzers, etc.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Thanks, but using it as a Keyword, it will not get returned with my
search results when I use MultiFieldQueryParser.

If I could I would use just parse(query) but that is not a static
method, only parse(query,field,analyzer) is... So when I do that and use
an analyzer, the keyword field isn't searched.



-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 2:14 PM
To: Lucene Users List
Subject: Re: Returning one result


On Fri, Dec 05, 2003 at 01:25:23PM -0500, Pleasant, Tracy wrote:
 What I meant is.
 
 Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
 order for it to be searchable then it would have to be indexed and not
a
 keyword.

No. You should store it as a keyword. 

From the javadocs:
Keyword(String name, String value)
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored.


 
 So after using
 TermQuery query = new TermQuery(new Term(id, term));
 
 How would I return the other fields in the document?
 
 For instance to display a record it would get the record with the id #
 and then display the title, contents, etc.
 
 
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 05, 2003 11:32 AM
 To: Lucene Users List
 Subject: Re: Returning one result
 
 
 On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
  Maybe I should have been more clear.
 
  static Field Keyword(String name, String value)
Constructs a String-valued Field that is not tokenized,
but 
  is
  indexed and stored.
 
  I need to have it tokenized because people will search for that also

  and
  it needs to be searchable.
 
 Search for *what* also?  Tokenized means that it is broken into pieces

 which will be separate terms.  For example: see spot is tokenized 
 into see and spot, and searching for either of those terms will 
 match.
 
 Just try it and see, please!  :)
 
  Should I have two fields - one as a keyword and one as text?
 
 Depends on what you're doing... but an id field to me indicates 
 Field.Keyword to me, only.
 
  How would I do that when I want to return search results..
 
   Searcher searcher = new IndexSearcher(index);
   String term = request.getParameter(id);
 
   Query query = QueryParser.parse(term, id, new
  StandardAnalyzer());
 
   Hits hits  = searcher.search(query);
 
  Would it have to be something like:
   TermQuery query = ???
 
 Yes.  TermQuery query = new TermQuery(new Term(id, term));
 
 Use searcher.search exactly as you did before.  Just don't use 
 QueryParser to construct a query.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon

On Fri, Dec 05, 2003 at 03:14:08PM -0500, Pleasant, Tracy wrote:
 What do you mean 'add' in MultiFieldQueryParser?  I am using all the
 fields 

Sorry, that was wrong. What I meant to say is are you adding the field
to the array of fields that need to be searched? 

You need to use a MultiFieldQueryParser and pass it the array of fields
that you want searched.

Dror

 
 When I index it does 
 
  add (Field.Keyword(..,..))
 
 
 But I don't want the user to have to type ID:ID NUMBER It would be
 nice to just type ID Number. On your site if you just put: 11183 in the
 search box there are no results. 
 
 well, right now I'll just do it as text and query that field for the id
 # to display the document.  It can't hurt, right? :)  Unless the Keyword
 is a better way
 
 
 
 -Original Message-
 From: Dror Matalon [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 05, 2003 3:06 PM
 To: Lucene Users List
 Subject: Re: Returning one result
 
 
 On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
  Maybe we are having some communication issues. 
  
  At any rate, I did index it as a KEYWORD and when displaying used the
  TermQuery.
  
  The only problem with this though is by storing the ID (i.e. AR345) as
 a
  Keyword, if I search for AR345 no results are returned when I use the
  MultiFieldQueryParser .
  
  *sigh* *arg*
 
 OK. 
 
 Go to http://www.fastbuzz.com/search/index.jsp and type lucene without
 the quotes  and hit search. You get results from different channels/rss
 feeds.
 
 Now type lucene channel:11183 without the quotes and hit search. You
 get results only from Java-Channel. 
 
 We're inserting the field channel as a keyword, and it does what I
 understand you want to use AR345.
 
 I would guess that in MultiFieldQueryParser you are not doing an add()
 of the field for AR345 which is why the search fails. 
 
 Regards,
 
 Dror
 
 
  
  
  
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Friday, December 05, 2003 2:13 PM
  To: Lucene Users List
  Subject: Re: Returning one result
  
  
  On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
   Say ID is Ar3453 .. well the user may want to search for Ar3453, so
 in
   order for it to be searchable then it would have to be indexed and
 not
  
   a
   keyword.
  
  *arg* - we're having a serious communication issue here.  My advice to
 
  you is to actually write some simple tests (test-driven learning using
 
  JUnit is a wonderful way to experiement with Lucene, especially thanks
 
  to the RAMDirectory).  Please refer to my articles at java.net as well
 
  as the other great Lucene articles out there.
  
  Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
  javadocs say this for this method:
  
 /** Constructs a String-valued Field that is not tokenized, but is 
   indexed
   and stored.  Useful for non-text fields, e.g. date or url.  */
  
  [I added the emphasis there]
  
  
   So after using
   TermQuery query = new TermQuery(new Term(id, term));
  
   How would I return the other fields in the document?
  
   For instance to display a record it would get the record with the id
 #
   and then display the title, contents, etc.
  
  Umm you'd use *exactly* the same way as if you had used 
  QueryParser.  QueryParser would create a TermQuery for you, in fact, 
  except it would analyze your text first, which is what you want to 
  avoid, right?
  
  Hits.doc(n) gives you back a Document.  And then 
  Document.get(fieldName) gives you back the fields (as long as you
  
  stored  them in the index too).
  
  Again, please attempt some of these things in code.  It is a trivial 
  matter to index and search using RAMDirectory and experiment with 
  TermQuery, QueryParser, Analyzers, etc.
  
  Erik
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 -- 
 Dror Matalon
 Zapatec Inc 
 1700 MLK Way
 Berkeley, CA 94709
 http://www.fastbuzz.com
 http://www.zapatec.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon

Mike,

Boy, I said it so badly and yet you understood :-).

Dror

On Fri, Dec 05, 2003 at 03:31:15PM -0500, Michael Giles wrote:
 Tracy,
 
 I believe what Dror was referring to was the call to 
 MultiFieldQueryParser.parse(). The second argument to that call is a 
 String[] of field names on which to execute the query.  If the field that 
 contains AR345 isn't listed in that array, you will not get any results.
 
 -Mike
 
 At 03:14 PM 12/5/2003, you wrote:
 What do you mean 'add' in MultiFieldQueryParser?  I am using all the
 fields
 
 When I index it does
 
  add (Field.Keyword(..,..))
 
 
 But I don't want the user to have to type ID:ID NUMBER It would be
 nice to just type ID Number. On your site if you just put: 11183 in the
 search box there are no results.
 
 well, right now I'll just do it as text and query that field for the id
 # to display the document.  It can't hurt, right? :)  Unless the Keyword
 is a better way
 
 
 
 -
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon
Then I'm out of ideas.  The next thing is for you to post your search
code so we can see why it's not searching the field.

On Fri, Dec 05, 2003 at 03:34:38PM -0500, Pleasant, Tracy wrote:
 Yes it is in the list of arrays that I want searched.
 
 -Original Message-
 From: Dror Matalon [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 05, 2003 3:32 PM
 To: Lucene Users List
 Subject: Re: Returning one result
 
 
 
 On Fri, Dec 05, 2003 at 03:14:08PM -0500, Pleasant, Tracy wrote:
  What do you mean 'add' in MultiFieldQueryParser?  I am using all the
  fields 
 
 Sorry, that was wrong. What I meant to say is are you adding the field
 to the array of fields that need to be searched? 
 
 You need to use a MultiFieldQueryParser and pass it the array of fields
 that you want searched.
 
 Dror
 
  
  When I index it does 
  
   add (Field.Keyword(..,..))
  
  
  But I don't want the user to have to type ID:ID NUMBER It would be
  nice to just type ID Number. On your site if you just put: 11183 in
 the
  search box there are no results. 
  
  well, right now I'll just do it as text and query that field for the
 id
  # to display the document.  It can't hurt, right? :)  Unless the
 Keyword
  is a better way
  
  
  
  -Original Message-
  From: Dror Matalon [mailto:[EMAIL PROTECTED]
  Sent: Friday, December 05, 2003 3:06 PM
  To: Lucene Users List
  Subject: Re: Returning one result
  
  
  On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
   Maybe we are having some communication issues. 
   
   At any rate, I did index it as a KEYWORD and when displaying used
 the
   TermQuery.
   
   The only problem with this though is by storing the ID (i.e. AR345)
 as
  a
   Keyword, if I search for AR345 no results are returned when I use
 the
   MultiFieldQueryParser .
   
   *sigh* *arg*
  
  OK. 
  
  Go to http://www.fastbuzz.com/search/index.jsp and type lucene
 without
  the quotes  and hit search. You get results from different
 channels/rss
  feeds.
  
  Now type lucene channel:11183 without the quotes and hit search. You
  get results only from Java-Channel. 
  
  We're inserting the field channel as a keyword, and it does what I
  understand you want to use AR345.
  
  I would guess that in MultiFieldQueryParser you are not doing an add()
  of the field for AR345 which is why the search fails. 
  
  Regards,
  
  Dror
  
  
   
   
   
   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Friday, December 05, 2003 2:13 PM
   To: Lucene Users List
   Subject: Re: Returning one result
   
   
   On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
Say ID is Ar3453 .. well the user may want to search for Ar3453,
 so
  in
order for it to be searchable then it would have to be indexed and
  not
   
a
keyword.
   
   *arg* - we're having a serious communication issue here.  My advice
 to
  
   you is to actually write some simple tests (test-driven learning
 using
  
   JUnit is a wonderful way to experiement with Lucene, especially
 thanks
  
   to the RAMDirectory).  Please refer to my articles at java.net as
 well
  
   as the other great Lucene articles out there.
   
   Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
   javadocs say this for this method:
   
  /** Constructs a String-valued Field that is not tokenized, but
 is 
indexed
and stored.  Useful for non-text fields, e.g. date or url.  */
   
   [I added the emphasis there]
   
   
So after using
TermQuery query = new TermQuery(new Term(id, term));
   
How would I return the other fields in the document?
   
For instance to display a record it would get the record with the
 id
  #
and then display the title, contents, etc.
   
   Umm you'd use *exactly* the same way as if you had used 
   QueryParser.  QueryParser would create a TermQuery for you, in fact,
 
   except it would analyze your text first, which is what you want to 
   avoid, right?
   
   Hits.doc(n) gives you back a Document.  And then 
   Document.get(fieldName) gives you back the fields (as long as you
   
   stored  them in the index too).
   
   Again, please attempt some of these things in code.  It is a trivial
 
   matter to index and search using RAMDirectory and experiment with 
   TermQuery, QueryParser, Analyzers, etc.
   
 Erik
   
   
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
  
  -- 
  Dror Matalon
  Zapatec Inc 
  1700 MLK Way
  Berkeley, CA 94709
  http://www.fastbuzz.com
  http://www.zapatec.com
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  

Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 04:28  PM, Dror Matalon wrote:
Then I'm out of ideas.  The next thing is for you to post your search
code so we can see why it's not searching the field.
Giving up so easily, Dror?!  :))

The problem is, when using any type of QueryParser with a Keyword 
field, you have to then be careful about analysis.  My guess is that at 
query parsing time, that the analyzer is stripping numbers or in some 
mangling the id.

Look back in the e-mail archives for my AnalyzerUtils, run a string 
containing just a sample id through it using the analyzer you are using 
in your real code and see what comes out.

Again, Tracy, please read the articles at java.net on Lucene - and 
there is one on QueryParser too.  You are definitely having a learning 
curve situation here and aren't quite in the zone of Lucene 
understanding yet, that is why folks here are getting frustrated with 
your questions.  We are hanging in there with you though and will get 
you through this.  I'll give you some pointers here - in the latest 
Lucene 1.3 versions, there is a PerFieldAnalyzerWrapper that might come 
in handy here - otherwise you might consider using a different analyzer.

A good first pass is to experiment with the WhitespaceAnalyzer and be 
sure to phrase your test queries with the same case you indexed with.  
I believe you'll find that it will work.  If it works then, you will 
have a very good clue that the analyzer is the problem.  At that point, 
go and read those java.net articles I wrote, especially the first one 
having to do with analyzers.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


write.lock

2003-12-05 Thread Aaron Galea
Hi

I am starting to get an error about a write.lock in lucene when creating an index in 
an empty directory. It used to work fine before but now it started to occur and as far 
as I know I didn't touch anything. Printing out the stack trace from the excpetion 
thrown I get the following :

java.io.IOException: couldn't delete write.lock
at org.apache.lucene.store.FSDirectory.create(Unknown Source)
at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
at org.apache.lucene.index.IndexWriter.init(Unknown Source)
at qa.answerextraction.AnswerExtractionImpl.processDocument(Unknown Source)
at qa.answerextraction.AnswerExtractionServerPOA._invoke(Unknown Source)   
 at org.jacorb.poa.RequestProcessor.invokeOperation(Unknown Source)
at org.jacorb.poa.RequestProcessor.process(Unknown Source)
at org.jacorb.poa.RequestProcessor.run(Unknown Source)

The code creating this problem is:

IndexWriter writer;

try {
writer = new IndexWriter(indexLocation, sa,false);
} catch (java.io.IOException e) {
writer = new IndexWriter(indexLocation, sa,true);
}

This problem only happens when indexing the very first file. After that it works fine. 
All that seems it needs in the directory is a segments file. 

Could anyone explain to me the problem or what I am doing wrong in it?

regards 
Aaron 





Sent through the WebMail system at nextgen.net.mt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index and Field.Text

2003-12-05 Thread Tatu Saloranta
On Friday 05 December 2003 10:45, Doug Cutting wrote:
 Tatu Saloranta wrote:
  Also, shouldn't there be at least 3 methods that take Readers; one for
  Text-like handling, another for UnStored, and last for UnIndexed.

 How do you store the contents of a Reader?  You'd have to double-buffer
 it, first reading it into a String to store, and then tokenizing the
 StringReader.  A key feature of Reader values is that they're streamed:

Not really, you can pass Reader to tokenizer, which then reads and tokenizes 
directly (I think that's the way code also works). This because internally 
String is read using StringReader, so passing a String looks more like a 
convenience feature?

 the entire value is never in RAM.  Storing a Reader value would remove
 that advantage.  The current API makes this explicit: when you want
 something streamed, you pass in a Reader, when you're willing to have
 the entire value in memory, pass in a String.

I guess for things that are both tokenized and stored, passing a Reader can't 
really help a lot; if one wants to reduce mem usage, text needs to be read 
twice, or analyzer needs to help in writing output; or, text needs to be read 
in-memory much like what happens now. It'd simplify application code a bit, 
but wouldn't do much more.

So I guess I need to downgrade my suggestion to require just 2 
Reader-taking factory methods? :-)
I still think that index-only and store-only version would both make sense. In 
latter case, storing could be done in fully streaming fashion; in former 
tokenization can be done?

 Yes, it is a bit confusing that Text(String, String) stores its value,
 while Text(String, Reader) does not, but it is at least well documented.
   And we cannot change it: that would break too many applications.  But
 we can put this on the list for Lucene 2.0 cleanups.

Yes, I understand that. It'd not be reasonable to do such a change. But how 
about adding more intuitive factory method (UnStored(String, Reader))?

 When I first wrote these static methods I meant for them to be
 constructor-like.  I wanted to have multiple Field(String, String)
 constructors, but that's not possible, so I used capitalized static
 methods instead.  I've never seen anyone else do this (capitalize any
 method but a real constructor) so I guess I didn't start a fad!  This

:-)

 should someday too be cleaned up.  Lucene was the first Java program
 that I ever wrote, and thus its style is in places non-standard.  Sorry.

Best standards are created by people doing things others use, follow or 
imitate... so it was worth a try! :-)

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]