search binning support

2006-10-11 Thread Yu-Hui Jin

Say I have N categories, each item is assigned to one or more categories.
And i want the search results being counted against each of the categories.

I checked the Lucene in Action book, and there doesn't seem to be this
feature. So is there any plan to add binning to Lucene?

It looks like this involves modifying part of the Lucene's implementation,
in that, we can:

- specify which index field is used as the binning field.
- after we grab the doc-id list, we perform N intersections just to get the
count:  each intersection is performed on the result doc-id list and the
doc-id list for all items assigned to a category.

Is there any better approach to do that? or any optimizations to this?


thanks,

-Hui


Re: search binning support

2006-10-11 Thread Doron Cohen
Try searching for facet or faceted browsing in Lucene mailing lists and
also in Solr, e.g.
http://www.nabble.com/forum/Search.jtp?forum=44local=yquery=facet

Yu-Hui Jin [EMAIL PROTECTED] wrote on 10/10/2006 17:27:55:

 Say I have N categories, each item is assigned to one or more categories.
 And i want the search results being counted against each of the
categories.

 I checked the Lucene in Action book, and there doesn't seem to be this
 feature. So is there any plan to add binning to Lucene?

 It looks like this involves modifying part of the Lucene's
implementation,
 in that, we can:

 - specify which index field is used as the binning field.
 - after we grab the doc-id list, we perform N intersections just to get
the
 count:  each intersection is performed on the result doc-id list and the
 doc-id list for all items assigned to a category.

 Is there any better approach to do that? or any optimizations to this?


 thanks,

 -Hui


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ferret's changes

2006-10-11 Thread Chuck Williams

David Balmain wrote on 10/10/2006 08:53 PM:
 On 10/11/06, Chuck Williams [EMAIL PROTECTED] wrote:

 I personally would always store term vectors since I use a
 StandardTokenizer and Stemming. In this case highlighting matches in
 small documents is not trivial. Ferret's highlighter matches even
 sloppy phrase queries and phrases with gaps between the terms
 correctly. I couldn't do this without the use of term vectors.

I use stemming as well, but am not yet matching phrases like that. 
Perhaps term vectors will be useful to achieve this, although they come
at a high cost and it doesn't seem difficult or expensive to do the
matching directly on the text of small items.

 I suppose it would be possible for the single conceptual field 'body' to
 be represented with two physical fields 'smallBody' and 'largeBody'
 where the former stores term vectors and the latter does not.

 If I really wanted to solve this problem I would use this solution. It
 is pretty easy to search multiple fields when I need to. Ferret's
 Query language even supports it:

smallBody|largeBody:phrase to search for

Couldn't agree more.  I have a number of extensions to Lucene's query
parser, including this for multiple fields:

{smallBody largeBody}:phrase to search for


 In the end, I think the benifits of my model far outweight the costs.
 For me at least anyway.

Based on the performance figures so far, it seems they do!  I think
dynamic term vectors have a substantial benefit, but can easily be
implemented in model where all field indexing properties are fixed.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TestGermanStemFilter

2006-10-11 Thread Matthias Roßmann

Hello guys,

is the TestGermanStemFilter class requested in any ant script in Lucene
2.0.0,
because I wonder in the TestGermanStemFilter class is a a hardcoded path

File dataDir = new File(System.getProperty(dataDir, ./bin));
File testFile = new File(dataDir,org/apache/lucene/analysis/de/data.txt);


Does anyone know?

Greets,
 Matthias.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TestGermanStemFilter

2006-10-11 Thread Erik Hatcher

Yes, the Ant build file sets the dataDir system property:

$ grep dataDir *.xml
common-build.xml:  sysproperty key=dataDir file=src/test/



On Oct 11, 2006, at 6:26 AM, Matthias Roßmann wrote:


Hello guys,

is the TestGermanStemFilter class requested in any ant script in  
Lucene

2.0.0,
because I wonder in the TestGermanStemFilter class is a a hardcoded  
path


File dataDir = new File(System.getProperty(dataDir, ./bin));
File testFile = new 	File(dataDir,org/apache/lucene/analysis/de/ 
data.txt);



Does anyone know?

Greets,
 Matthias.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search binning support

2006-10-11 Thread Joe R

We faced what might be a similar problem not too long ago.  Our app is supposed
to allow for foldering -- i.e., a document may be in one or more folders that
the user creates and populates by hand or via query.  We used a simple btree
database from Berkeley JE and used a hit collector to filter against that
database when selecting results.  We didn't go with an all-Lucene approach
because the foldering is supposed to be responsive (the user should see the
document in the folder within ~5 seconds) and we have large catalog sizes; in
other words, we didn't want to modify and re-optimize the index very often. 
This also allowed us to do our own per-field stored field implementation:
another Berkeley DB holds all our stored fields and the Lucene index only
stores a single, small, non-Lucene document ID.  We pull only the small
document ID for the hit collector and only those fields needed for the results
from Berkeley.


-j


--- Yu-Hui Jin [EMAIL PROTECTED] wrote:

 Say I have N categories, each item is assigned to one or more categories.
 And i want the search results being counted against each of the categories.
 
 I checked the Lucene in Action book, and there doesn't seem to be this
 feature. So is there any plan to add binning to Lucene?
 
 It looks like this involves modifying part of the Lucene's implementation,
 in that, we can:
 
 - specify which index field is used as the binning field.
 - after we grab the doc-id list, we perform N intersections just to get the
 count:  each intersection is performed on the result doc-id list and the
 doc-id list for all items assigned to a category.
 
 Is there any better approach to do that? or any optimizations to this?
 
 
 thanks,
 
 -Hui
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ferret's changes

2006-10-11 Thread Ning Li

Actually not using single doc segments was only possible due to the
fact that I have constant field numbers so both optimizations stem
from this one change...


Not using single doc segments can be done without constant field numbers... :-)

Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 2.0.1 release date

2006-10-11 Thread George Aroush
Hi folks,

What's the plan for Lucene 2.0.1 release date?

Thanks!

-- George Aroush


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]