search binning support
Say I have N categories, each item is assigned to one or more categories. And i want the search results being counted against each of the categories. I checked the Lucene in Action book, and there doesn't seem to be this feature. So is there any plan to add binning to Lucene? It looks like this involves modifying part of the Lucene's implementation, in that, we can: - specify which index field is used as the binning field. - after we grab the doc-id list, we perform N intersections just to get the count: each intersection is performed on the result doc-id list and the doc-id list for all items assigned to a category. Is there any better approach to do that? or any optimizations to this? thanks, -Hui
Re: search binning support
Try searching for facet or faceted browsing in Lucene mailing lists and also in Solr, e.g. http://www.nabble.com/forum/Search.jtp?forum=44local=yquery=facet Yu-Hui Jin [EMAIL PROTECTED] wrote on 10/10/2006 17:27:55: Say I have N categories, each item is assigned to one or more categories. And i want the search results being counted against each of the categories. I checked the Lucene in Action book, and there doesn't seem to be this feature. So is there any plan to add binning to Lucene? It looks like this involves modifying part of the Lucene's implementation, in that, we can: - specify which index field is used as the binning field. - after we grab the doc-id list, we perform N intersections just to get the count: each intersection is performed on the result doc-id list and the doc-id list for all items assigned to a category. Is there any better approach to do that? or any optimizations to this? thanks, -Hui - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ferret's changes
David Balmain wrote on 10/10/2006 08:53 PM: On 10/11/06, Chuck Williams [EMAIL PROTECTED] wrote: I personally would always store term vectors since I use a StandardTokenizer and Stemming. In this case highlighting matches in small documents is not trivial. Ferret's highlighter matches even sloppy phrase queries and phrases with gaps between the terms correctly. I couldn't do this without the use of term vectors. I use stemming as well, but am not yet matching phrases like that. Perhaps term vectors will be useful to achieve this, although they come at a high cost and it doesn't seem difficult or expensive to do the matching directly on the text of small items. I suppose it would be possible for the single conceptual field 'body' to be represented with two physical fields 'smallBody' and 'largeBody' where the former stores term vectors and the latter does not. If I really wanted to solve this problem I would use this solution. It is pretty easy to search multiple fields when I need to. Ferret's Query language even supports it: smallBody|largeBody:phrase to search for Couldn't agree more. I have a number of extensions to Lucene's query parser, including this for multiple fields: {smallBody largeBody}:phrase to search for In the end, I think the benifits of my model far outweight the costs. For me at least anyway. Based on the performance figures so far, it seems they do! I think dynamic term vectors have a substantial benefit, but can easily be implemented in model where all field indexing properties are fixed. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TestGermanStemFilter
Hello guys, is the TestGermanStemFilter class requested in any ant script in Lucene 2.0.0, because I wonder in the TestGermanStemFilter class is a a hardcoded path File dataDir = new File(System.getProperty(dataDir, ./bin)); File testFile = new File(dataDir,org/apache/lucene/analysis/de/data.txt); Does anyone know? Greets, Matthias. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TestGermanStemFilter
Yes, the Ant build file sets the dataDir system property: $ grep dataDir *.xml common-build.xml: sysproperty key=dataDir file=src/test/ On Oct 11, 2006, at 6:26 AM, Matthias Roßmann wrote: Hello guys, is the TestGermanStemFilter class requested in any ant script in Lucene 2.0.0, because I wonder in the TestGermanStemFilter class is a a hardcoded path File dataDir = new File(System.getProperty(dataDir, ./bin)); File testFile = new File(dataDir,org/apache/lucene/analysis/de/ data.txt); Does anyone know? Greets, Matthias. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search binning support
We faced what might be a similar problem not too long ago. Our app is supposed to allow for foldering -- i.e., a document may be in one or more folders that the user creates and populates by hand or via query. We used a simple btree database from Berkeley JE and used a hit collector to filter against that database when selecting results. We didn't go with an all-Lucene approach because the foldering is supposed to be responsive (the user should see the document in the folder within ~5 seconds) and we have large catalog sizes; in other words, we didn't want to modify and re-optimize the index very often. This also allowed us to do our own per-field stored field implementation: another Berkeley DB holds all our stored fields and the Lucene index only stores a single, small, non-Lucene document ID. We pull only the small document ID for the hit collector and only those fields needed for the results from Berkeley. -j --- Yu-Hui Jin [EMAIL PROTECTED] wrote: Say I have N categories, each item is assigned to one or more categories. And i want the search results being counted against each of the categories. I checked the Lucene in Action book, and there doesn't seem to be this feature. So is there any plan to add binning to Lucene? It looks like this involves modifying part of the Lucene's implementation, in that, we can: - specify which index field is used as the binning field. - after we grab the doc-id list, we perform N intersections just to get the count: each intersection is performed on the result doc-id list and the doc-id list for all items assigned to a category. Is there any better approach to do that? or any optimizations to this? thanks, -Hui __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ferret's changes
Actually not using single doc segments was only possible due to the fact that I have constant field numbers so both optimizations stem from this one change... Not using single doc segments can be done without constant field numbers... :-) Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 2.0.1 release date
Hi folks, What's the plan for Lucene 2.0.1 release date? Thanks! -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]