RE: HITCOLLECTOR+SCORE+DELIMA

2004-12-13 Thread Vikas Gupta
 On Dec 10, 2004, at 7:39 AM, Karthik N S wrote:
  I am still in delima on How to use the HitCollector for returning
  Hits hits
  between scores  0.2f to 1.0f ,
 
  There is not a simple example for the same, yet lot's of talk on usage
  for
  the same on the form.

1) I am not 100% sure about this but it might work.

Add the code starting with  in IndexSearcher.java::search()

 // inherit javadoc
  public TopDocs search(Query query, Filter filter, final int nDocs)
   throws IOException {
Scorer scorer = query.weight(this).scorer(reader);
if (scorer == null)
  return new TopDocs(0, new ScoreDoc[0]);

final BitSet bits = filter != null ? filter.bits(reader) : null;
final HitQueue hq = new HitQueue(nDocs);
final int[] totalHits = new int[1];
scorer.score(new HitCollector() {
public final void collect(int doc, float score) {
  if (score  0.0f  // ignore zeroed buckets
  score 0.2f  score1.0f)
  (bits==null || bits.get(doc))) {// skip docs not in bits
totalHits[0]++;
hq.insert(new ScoreDoc(doc, score));
  }
}
  });



2) Filter examples are in Lucene in Action book, Chapter 5. I wrote an
example as well:



String query = odyssey;

BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term(content, query)), true, false);

BooleanQuery bqf = new BooleanQuery();
bqf.add(new TermQuery(new Term(H2, query)), true, false);

Filter f = new QueryFilter(bqf);

IndexReader reader = IndexReader.open(new File(dir, 
index).getCanonicalPath());
Searcher luceneSearcher = new 
org.apache.lucene.search.IndexSearcher(reader);
luceneSearcher.setSimilarity(new NutchSimilarity());

//Logically the following would be executed as follows: Find all
//the docs matching bq. Select the ones which matchbqf
hits = luceneSearcher.search(bq, f);

System.out.print(query:  + query);

System.out.println(Total hits:  + hits.length());

3) delima is spelled as dilemma


-Vikas Gupta

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Karthik N S
Hi

Vikas Gupta


Since Erik Replied to me on my last mail, A FILTER cand be built for the
same can be to
fetch scrores between  0.2f to 1.0f.

Can u please spare me some code for the same.

[ Sorry for the Spell mistake, My Mail IDE does not have one ]
With regards
Karthik




-Original Message-
From: Vikas Gupta [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 3:17 PM
To: Lucene Users List
Subject: RE: HITCOLLECTOR+SCORE+DELIMA


 On Dec 10, 2004, at 7:39 AM, Karthik N S wrote:
  I am still in delima on How to use the HitCollector for returning
  Hits hits
  between scores  0.2f to 1.0f ,
 
  There is not a simple example for the same, yet lot's of talk on usage
  for
  the same on the form.

1) I am not 100% sure about this but it might work.

Add the code starting with  in IndexSearcher.java::search()

 // inherit javadoc
  public TopDocs search(Query query, Filter filter, final int nDocs)
   throws IOException {
Scorer scorer = query.weight(this).scorer(reader);
if (scorer == null)
  return new TopDocs(0, new ScoreDoc[0]);

final BitSet bits = filter != null ? filter.bits(reader) : null;
final HitQueue hq = new HitQueue(nDocs);
final int[] totalHits = new int[1];
scorer.score(new HitCollector() {
public final void collect(int doc, float score) {
  if (score  0.0f  // ignore zeroed buckets
  score 0.2f  score1.0f)
  (bits==null || bits.get(doc))) {// skip docs not in bits
totalHits[0]++;
hq.insert(new ScoreDoc(doc, score));
  }
}
  });



2) Filter examples are in Lucene in Action book, Chapter 5. I wrote an
example as well:



String query = odyssey;

BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term(content, query)), true, false);

BooleanQuery bqf = new BooleanQuery();
bqf.add(new TermQuery(new Term(H2, query)), true, false);

Filter f = new QueryFilter(bqf);

IndexReader reader = IndexReader.open(new File(dir,
index).getCanonicalPath());
Searcher luceneSearcher = new
org.apache.lucene.search.IndexSearcher(reader);
luceneSearcher.setSimilarity(new NutchSimilarity());

//Logically the following would be executed as follows: Find all
//the docs matching bq. Select the ones which matchbqf
hits = luceneSearcher.search(bq, f);

System.out.print(query:  + query);

System.out.println(Total hits:  + hits.length());

3) delima is spelled as dilemma


-Vikas Gupta

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Nader Henein
Dude, and I say this with love, it's open source, you've got the code, 
take the initiative, DIY, be creative and share your findings with the 
rest of us.

Personally I would be interested to see how you do this, keep your 
changes documented and share.

Nader Henein
Karthik N S wrote:
Hi Erik
Apologies..
I got Confused with the last mail.
 

Iterate over Hits.  returns large hit values and Iteration on Hits for
 

scores consumes time ,
so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits.
Note:- The search is being done on Field Type 'Text' ,consists of 'Contents'
from various Html documents
Please Advise me
Karthik

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 5:05 PM
To: Lucene Users List
Subject: Re: HITCOLLECTOR+SCORE+DELIMA

On Dec 13, 2004, at 1:16 AM, Karthik N S wrote:
 

So u say I have to Build a Filter to Collect all the Scores between
the 2
Ranges [ 0.2f to 1.0f]
   

My message is being misinterpreted.  I said filter as a verb, not a
noun.  :)  In other words, I was not intending to mean write a Filter -
a Filter would not be able to filter on score.
 

so the API for the same would be
Hits hit = search(Query query, Filter filtertoGetScore)
But while writing the Filter  Score again depends on Hits  
Score =
hits.score(x);
   

Again, you cannot write a Filter (capital 'F') to deal with score.
Please re-read what I said below...
 

Hits are in descending score
order, so you may just want to use Hits and filter based on the score
provided by hits.score(i).
   

Iterate over Hits... when you encounter scores below your desired
range, stop iterating.  Why is this simple procedure not good enough
for what you are trying to achieve?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Erik Hatcher
On Dec 13, 2004, at 6:58 AM, Karthik N S wrote:
Iterate over Hits.  returns large hit values and Iteration on Hits 
for
scores consumes time ,
so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the 
Hits.
Why do you need to do this *prior* to getting Hits?
You have yet to justify what you're asking.  I almost guarantee you 
that navigating Hits in the way I said will be as fast as you need it 
to be.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LIMO problems

2004-12-13 Thread Daniel Cortes
Hi, I want to know what library do you use for search in PPT files?
POI support this?
thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-13 Thread Praveen Peddi
If its not added to the release code already, is there any reason for it 
being not added. Seems like many people agree that this is an important 
functionality of sorting.

Its just that I can't get permission to use customized libraries in our 
company. Either we have to use the library as is or implement our own stuff. 
We don't want to go into the pains of maintaining the 3rd party library code 
whenever we migrate from one version to other. I would assume everyone would 
have the same problem.

Is there any possibility this patch contributed by Aviran can be added to 
the actual release branch.

Thanks
Praveen
- Original Message - 
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 11:30 AM
Subject: RE: sorting tokenized field

The patch is very simple.
What is does is it checks if the field you want to sort on is tokenized. If
it is it loads the values from the documents to the sorting table.
The only con in this approach is that loading the values this way is much
slower than if the values where Keywords, but other than that it should work
just fine.
Aviran
http://www.aviransplace.com
-Original Message-
From: Praveen Peddi [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 10:48 AM
To: lucenelist
Subject: Fw: sorting tokenized field
Hi all,
I forwarding the same email I sent before. Just wanted to try my luck again
:).
Thanks in advance.
Praveen
- Original Message - 
From: Praveen Peddi [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 3:33 PM
Subject: Re: sorting tokenized field


Since I am not aware of the lucene code much, I couldn't make much out
of
your patch. But is this patch already tested and proved to be efficient?
If so, why can't it be merge into the lucene code and made it part of the
release. I think the bug is valid. Its very likely that people want to
sort on tokenized fields.
If I apply this patch to lucene code and use it for myself, I will
have
hard time managing it in future (while upgrading lucene library). If the
pathc is applied to lucene release code, it would be very easy for the
lucene users.
If possible, can someone explain what the path does? I am trying to
understand what exactly changed but could not figrue out.
Praveen
- Original Message -
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 2:30 PM
Subject: RE: sorting tokenized field

I have suggested a solution for this problem (
http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use
the  patch suggested there and recompile lucene.
Aviran
http://www.aviransplace.com
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 13:53 PM
To: Lucene Users List
Subject: Re: sorting tokenized field

On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort
tokenized field, either the application has to duplicate field with
diff name and not tokenize it or come up with something else. But
shouldn't the search engine takecare of this? Are there any plans of
putting this functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be
available for sorting.
Adding one more line to your indexing code to accommodate your
sorting needs seems a pretty small price to pay.  Do you have suggestions
to
improve how this works?   Or how it is documented?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-13 Thread Erik Hatcher
On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote:
If its not added to the release code already, is there any reason for 
it being not added.
As noted, there is a performance issue with sorting by tokenized 
fields.  It would seem far more advisable for you to simply add another 
field used for sorting which is untokenized.

Why has it not been added?  There have been several committers quite 
active in the codebase (myself excluded).  If you wish for changes to 
be committed, perseverance and patience are key.  Keep lobbying, but do 
so kindly.  When there are viable alternatives (such as adding an 
untokenized field for sorting) then certainly there is less incentive 
to commit changes.  Lucene's codebase is pretty clean and tight - it is 
wise for us to be very selective about changes to it.


 Seems like many people agree that this is an important functionality 
of sorting.
Many do, but not all.  I'm -0 on this change, meaning I'm not veto'ing 
it, but I'm not actually for it given the performance issue.

Its just that I can't get permission to use customized libraries in 
our company.
No custom library is needed for you to add an untokenized field for 
sorting purposes.

Also, sorting is extensible.  Check out the Lucene in Action code, 
specifically the lia.extsearch.sorting.DistanceSortingTest class.

Maybe you could add your own custom sorting code that could do what you 
want without patching Lucene.

Is there any possibility this patch contributed by Aviran can be added 
to the actual release branch.
Keep lobbying - other committers may feel differently than I do about 
it and add it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hit 2 has score less than Hit 3

2004-12-13 Thread Vikas Gupta
I have come across a scenario where the hits returned are not sorted. Or
maybe they are sorted but the explanation is not correct.

Take a look at
http://cofferdam.cs.utexas.edu:8080/search.jsp?query=space+odysseyhitsPerPage=10hitsPerSite=0

Look at the top 3 results.

Score of Hit 1 is 1.0188559
Score of Hit 2 is 0.9934416
Score of Hit 3 is 1.0188559

I can't explain how score of hit 2 can be  hit 3. I thought the hits that
were returned were sorted.

An explanation is that the explanation of hit 2 is not correctly computed.
Has anyone encountered this before?

FYI, the docs corresponding to hits 1,2 and 3 have exactly the same
scoring fields(By scoring fields, I mean the fields used in the query).

Thanks for your time.

-Vikas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Erik Hatcher
On Dec 13, 2004, at 11:16 PM, Karthik N S wrote:
 time [ A simple search of 'handbags' returned 1,60,000 hits and time 
taken
was 440 secs ,in production Env  / May be our
 Coding is poor,But we are constantly improving the process ].
If your searches are taking 440 seconds, you have something more 
fundamentally wrong.  You are either doing some large 
wildcard/range/fuzzy expansions or you're accessing every document from 
all your hits.  Is the searcher.search() method taking that long?  I 
bet not.  Or rather is it the iteration over the Hits that is killing 
the search time, which is what I suspect?

We've emphasized numerous times that calling hits.doc(i) is a resource 
hit.  Don't do it for documents you aren't going to show.  To filter by 
score, use hits.score(i) first.

 { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and
BlackDawn Java 1.4.2 with Args  -XX:+UseParallelGC for
 Garbage Collection  }
Please narrow your code down to a clean, succinct example that you can 
post.  It is difficult to help you without details of your code (but 
let me emphasize again - it needs to be clean and succinct so it is 
quick for us to get a handle on).

 To be One step in advance ,We also have an adjecent Fields 'Vendor
','Price' which we have to accordingly Compare
 Best/Poor/Least results . So We have to have to limit the hits
accordingly,since Lucene API does not provide any way to
 inject this limiting facility *prior* to getting the hits .
Ah, so you are accessing every document to get this field information.  
It is incorrect that you cannot filter prior to getting hits.  You have 
a couple of options in filtering by a field value - use a QueryFilter 
or simply AND a RangeQuery to the original query.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Karthik N S
Hi Erik


What exactly do u mean by this


We've emphasized numerous times that calling hits.doc(i) is a resource
hit.  Don't do it for documents you aren't going to show.  To filter by
score, use hits.score(i) first.

 I am bit Confused u mean to say Replace

   hits.doc(i)

by

  hits.score(i)



Also

 Ah, so you are accessing every document to get this field information.
 It is incorrect that you cannot filter prior to getting hits.  You have
 a couple of options in filtering by a field value - use a QueryFilter
. or simply AND a RangeQuery to the original query.


Since the portal we ar building for is a eCommerce one, We have to return
SearchWord across

  ( 7 ) x 1000 x  15000  documents , Get most of the Relevant His (Where
ever Score is between 0.5 to 1.0 )

  and then Sort the adjecent Fields 'Vendors' and 'Price' in ASC Order


 In such a case We cannot use RangeQuery without priorly knowing what
exactly the Consumer want's


 Is it not possible to have a Generalized Filter in further versions of API
, to Inject some minor factors prior to

 getting the Hits returned.


Thx in advance
Karthik



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 3:44 PM
To: Lucene Users List
Subject: Re: HITCOLLECTOR+SCORE+DELIMMA



On Dec 13, 2004, at 11:16 PM, Karthik N S wrote:
  time [ A simple search of 'handbags' returned 1,60,000 hits and time
 taken
 was 440 secs ,in production Env  / May be our
  Coding is poor,But we are constantly improving the process ].

If your searches are taking 440 seconds, you have something more
fundamentally wrong.  You are either doing some large
wildcard/range/fuzzy expansions or you're accessing every document from
all your hits.  Is the searcher.search() method taking that long?  I
bet not.  Or rather is it the iteration over the Hits that is killing
the search time, which is what I suspect?

We've emphasized numerous times that calling hits.doc(i) is a resource
hit.  Don't do it for documents you aren't going to show.  To filter by
score, use hits.score(i) first.

  { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and
 BlackDawn Java 1.4.2 with Args  -XX:+UseParallelGC for

  Garbage Collection  }

Please narrow your code down to a clean, succinct example that you can
post.  It is difficult to help you without details of your code (but
let me emphasize again - it needs to be clean and succinct so it is
quick for us to get a handle on).

  To be One step in advance ,We also have an adjecent Fields 'Vendor
 ','Price' which we have to accordingly Compare
  Best/Poor/Least results . So We have to have to limit the hits
 accordingly,since Lucene API does not provide any way to
  inject this limiting facility *prior* to getting the hits .

Ah, so you are accessing every document to get this field information.
It is incorrect that you cannot filter prior to getting hits.  You have
a couple of options in filtering by a field value - use a QueryFilter
or simply AND a RangeQuery to the original query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Erik Hatcher
On Dec 14, 2004, at 5:42 AM, Karthik N S wrote:
What exactly do u mean by this

We've emphasized numerous times that calling hits.doc(i) is a resource
hit.  Don't do it for documents you aren't going to show.  To filter 
by
score, use hits.score(i) first.
 I am bit Confused u mean to say Replace
   hits.doc(i)
by
  hits.score(i)
Here is some pseudo-code:
start = 0 or the starting index for the page you want to display
finish = last hits index you want to display
for i = start; i  finish ; i++
if hits.score(i) within tolerance
grab hits.doc(i)
I'm working hard to be helpful here.  I'm running out of answers for 
you though.  You are ignoring my requests to actually post code.  If 
you want further assistance shows us *exactly* what you're doing.

  ( 7 ) x 1000 x  15000  documents , Get most of the Relevant His 
(Where
ever Score is between 0.5 to 1.0 )

  and then Sort the adjecent Fields 'Vendors' and 'Price' in ASC Order
 In such a case We cannot use RangeQuery without priorly knowing 
what
exactly the Consumer want's
See above.  I cannot help with this without actual code (succinct clear 
code!).  Lucene can sort and filter if you leverage it appropriately.  
Please grab a copy of Lucene in Action for lots of details on sorting 
and filtering.

 Is it not possible to have a Generalized Filter in further versions 
of API
, to Inject some minor factors prior to
 getting the Hits returned.
This already exists.  Please try it out.  There have been numerous 
posts about this topic.  Lucene in Action covers it.  Our source code 
download has examples.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Opinions: Using Lucene as a thin database

2004-12-13 Thread Kevin L. Cobb
I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose. 

 

 

 



Re: sorting tokenized field

2004-12-13 Thread Praveen Peddi
Hi Erik,
Thanks a lot for your kind response. I appreciate the details.
What I meant by custom library is, applying aviran's patch to the lucene and 
maintaining it, not adding an extra field. Adding an extra field was my last 
option if I can't use the patch.

I did look at the extensible search and infact I wrote my own comparators 
(IgnoreCaseStringComparator and another custom comparator) and they work 
just fine. But I am not sure if this extensible search features helps me in 
sorting on tokenized field w/o adding the extra field. For now, I will just 
go for the extra field option and later if a more optimized solution is 
built into lucene I can use that.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 3:01 PM
Subject: Re: sorting tokenized field


On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote:
If its not added to the release code already, is there any reason for it 
being not added.
As noted, there is a performance issue with sorting by tokenized fields. 
It would seem far more advisable for you to simply add another field used 
for sorting which is untokenized.

Why has it not been added?  There have been several committers quite 
active in the codebase (myself excluded).  If you wish for changes to be 
committed, perseverance and patience are key.  Keep lobbying, but do so 
kindly.  When there are viable alternatives (such as adding an untokenized 
field for sorting) then certainly there is less incentive to commit 
changes.  Lucene's codebase is pretty clean and tight - it is wise for us 
to be very selective about changes to it.


 Seems like many people agree that this is an important functionality of 
sorting.
Many do, but not all.  I'm -0 on this change, meaning I'm not veto'ing it, 
but I'm not actually for it given the performance issue.

Its just that I can't get permission to use customized libraries in our 
company.
No custom library is needed for you to add an untokenized field for 
sorting purposes.

Also, sorting is extensible.  Check out the Lucene in Action code, 
specifically the lia.extsearch.sorting.DistanceSortingTest class.

Maybe you could add your own custom sorting code that could do what you 
want without patching Lucene.

Is there any possibility this patch contributed by Aviran can be added to 
the actual release branch.
Keep lobbying - other committers may feel differently than I do about it 
and add it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Opinions: Using Lucene as a thin database

2004-12-13 Thread Kevin L. Cobb
I don't have the requirement to do range type select, i.e. the only
operator I would need is the equals. Select * from MY_TABLE where
MY_NUMERIC_FIELD = 80.

My fields that are searchable in my model are always type KEYWORD. I
believe this forces the match to be exact. So thinking about it in
anything other than equals terms, I believe, would be a mistake. 

In any case, I believe that the requirement to use Lucene as a thin DB
means that your requirements for your database select are fairly simple
and straightforward. 

KLCobb

 
 

-Original Message-
From: Akmal Sarhan [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 14, 2004 10:24 AM
To: Lucene Users List
Subject: Re: Opinions: Using Lucene as a thin database

that sounds very interesting but how do you handle queries like
select * from MY_TABLE where MY_NUMERIC_FIELD  80

as far as I know you have only the range query so you will have to say

my_numeric_filed:[80 TO ??]
but this would not work in the a/m example or am I missing something?

regards

Akmal
Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07:
 Even we use lucene for similar purpose except that we index and store
quite 
 a few fields. Infact I also update partial documents as people
suggested. I 
 store all the indexed fields so I don't have to build the whole
document 
 again while updating partial document. The reason we do this is due to
the 
 speed. I found the lucene search on a millions objects is 4 to 5 times

 faster than our oracle queries (ofcourse this might be due to our
pitiful 
 database design :) ). It works great so far. the only caveat that we
had 
 till now was incremental updates. But now I am implementing real-time 
 updates so that the data in lucene index is almost always in sync with
data 
 in database. So now, our search does not goto the database at all.
 
 Praveen
 - Original Message - 
 From: Kevin L. Cobb [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, December 14, 2004 9:40 AM
 Subject: Opinions: Using Lucene as a thin database
 
 
 I use Lucene as a legitimate search engine which is cool. But, I am
also
 using it as a simple database too. I build an index with a couple of
 keyword fields that allows me to retrieve values based on exact
matches
 in those fields. This is all I need to do so it works just fine for my
 needs. I also love the speed. The index is small enough that it is
 wicked fast. Was wondering if anyone out there was doing the same of
it
 there are any dissenting opinions on using Lucene for this purpose.
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 !EXCUBATOR:41bf0221115901292611315!
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opinions: Using Lucene as a thin database

2004-12-13 Thread Erik Hatcher
On Dec 14, 2004, at 9:40 AM, Kevin L. Cobb wrote:
I use Lucene as a legitimate search engine which is cool. But, I am 
also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.
I use Lucene as the complete data storage for my blog at 
http://www.blogscene.org/erik - all HTTP requests map to a Lucene query 
(based on the path and optional query parameter).   I've been lame and 
have never put any caching in there.

I'm about to start a new project that really needs a relational 
database under the covers, but I'm cringing at the headaches involved 
compared to the joys of using Lucene.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Search HTML Files

2004-12-13 Thread Daniel Cortes
I've been trying the demo apps of Lucene for searching in html files, I 
want to know what problems or options are not implemented in this web 
aplication.
thks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HITCOLLECTOR+SCORE+DELIMA

2004-12-13 Thread Erik Hatcher
On Dec 13, 2004, at 1:16 AM, Karthik N S wrote:
So u say I have to Build a Filter to Collect all the Scores between 
the 2
Ranges [ 0.2f to 1.0f]
My message is being misinterpreted.  I said filter as a verb, not a 
noun.  :)  In other words, I was not intending to mean write a Filter - 
a Filter would not be able to filter on score.

so the API for the same would be
 Hits hit = search(Query query, Filter filtertoGetScore)
 But while writing the Filter  Score again depends on Hits   
Score =
hits.score(x);
Again, you cannot write a Filter (capital 'F') to deal with score.
Please re-read what I said below...
Hits are in descending score
order, so you may just want to use Hits and filter based on the score
provided by hits.score(i).
Iterate over Hits... when you encounter scores below your desired 
range, stop iterating.  Why is this simple procedure not good enough 
for what you are trying to achieve?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LIMO problems

2004-12-13 Thread David Spencer
Daniel Cortes wrote:
Hi, I want to know what library do you use for search in PPT files?
I use this (native code):
http://chicago.sourceforge.net/xlhtml
POI support this?
thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Karthik N S
Hi Erik

Apologies...




 In this Mailed
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
che.orgmsgNo=11254

 I have already told u that  doc.get( ); was coming in batches for a mear
hit of  '4000' , and this is happening in real

 time [ A simple search of 'handbags' returned 1,60,000 hits and time taken
was 440 secs ,in production Env  / May be our

 Coding is poor,But we are constantly improving the process ].


 { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and
BlackDawn Java 1.4.2 with Args  -XX:+UseParallelGC for

 Garbage Collection  }


 To be One step in advance ,We also have an adjecent Fields 'Vendor
','Price' which we have to accordingly Compare

 Best/Poor/Least results . So We have to have to limit the hits
accordingly,since Lucene API does not provide any way to

 inject this limiting facility *prior* to getting the hits .


 [ Excuse me Nader Henein ,I am from a Lucene-Users Form  NOT in
Lucene-Developer's Form,

  So we expect a Least possible Help ]


With Warm Regards
Karthik




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 6:39 PM
To: Lucene Users List
Subject: Re: HITCOLLECTOR+SCORE+DELIMMA



On Dec 13, 2004, at 6:58 AM, Karthik N S wrote:
 Iterate over Hits.  returns large hit values and Iteration on Hits
 for
 scores consumes time ,

 so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the
 Hits.

Why do you need to do this *prior* to getting Hits?

You have yet to justify what you're asking.  I almost guarantee you
that navigating Hits in the way I said will be as fast as you need it
to be.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Sorting based on calculations at search time

2004-12-13 Thread Gurukeerthi Gurunathan
Ah! It makes sense now... Thanks for the clarification Hoss. I think
it'll work in my case as I need to perform this calculation for every
search...

-Guru. 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Chris Hostetter
Sent: Friday, December 10, 2004 10:21 PM
To: Lucene Users List
Subject: RE: Sorting based on calculations at search time

: I believe you are talking about the boost factor for fields or
documents
: while searching. That does not apply in my case - maybe I am missing a
: point here.
: The weight field I was talking about is only for the calculation

Otis is suggesting that you set the boost of the document to be your
weight value.  That way Lucene will automaticly do your multiplucation
calculation when determining the score

The down side of this, is that i don't think there's anyway to keep it
from influencing the score on every search, so it's not something you
could use only on some queries.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]