Hey Everyone, I have been reading several threads about facet counts and category counts and was wondering if/how they might apply to searching for colors. Let's say that there is a Lucene index where each document corresponds to an image. In addition, each document contains the top 10 most frequently occurring colors (hex values) in its associated image along with some other axillary data. So a few example documents might look like this:
Field Value IMAGE 1 COLORS F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF E00000 EE0000 EEE000 EEEE00 PIXELS 10000 FORMAT JPG Field Value IMAGE 2 COLORS F00000 F0F000 F0FF00 F0FFF0 F0FFFF FFFFFF E00000 E0E000 E0EE00 E0EEE0 PIXELS 50000 FORMAT JPG Field Value IMAGE 3 COLORS F00000 F00F00 F00FF0 F00FFF FFFFFF E00000 E0E000 E0EE00 E0EEE0 E0EEEE PIXELS 1000 FORMAT PNG Field Value IMAGE 4 COLORS F00000 F00F00 F00FF0 F00FFF CCCCCC E00000 E0E000 E0EE00 E0EEE0 E0EEEE PIXELS 20000 FORMAT GIF If I setup the COLORS field such that it is of type Field.Keyword, then I can construct a query COLORS:FFFFFF with a StandardAnalyzer that returns the first 3 documents. The additional bit of information I'd like to calculate/return is the list of colors the images share, in order of most frequent to least frequent i.e. F00000, 3 occurrences E00000, 3 occurrences E0E000, 2 occurrences E0EE00, 2 occurrences FF0000, 1 occurrence ... I could use a custom HitCollector to collect and order this information, but it would cause a sizable performance hit.This thread from last year: http://www.nabble.com/forum/ViewTopic.jtp?topic=266441&tview=threaded Seems to suggest the creation of a String[] of colors and then iterate over it using a QueryFilter, AND'ing the result sets bits with the filtered bits, sorting based on cardinality. That works in my tests with a limited number of colors, but my String[] of colors could contain 16.7M values, and most certainly will contain at least several million. I read up on BitSets and Solr's use of them as well in this thread: http://www.gossamer-threads.com/lists/lucene/java-user/35427?do=post_view_threaded But I think that the solutions suggested assume several hundred facets i.e. price, manufacturer, size, etc. each one having a few hundred possible values. Those would make sense so that I could provide filtering based on FORMAT or number of PIXELS, but again, don't seem to scale in the colors example. Am I mistaken in thinking that my problem is similar to faceting/categorization/rollup, just on a different scale? I suppose I could also structure my document such that it looked like this: Field Value IMAGE 1 COLOR F00000 COLOR FF0000 COLOR FFF000 COLOR FFFF00 COLOR FFFFF0 COLOR FFFFFF COLOR E00000 COLOR EE0000 COLOR EEE000 COLOR EEEE00 PIXELS 10000 FORMAT JPG That would allow me to do something with TermEnums, but I don't see how it would be any better than using a HitCollector. Does the above document structure make things any better/worse/the same compared to the initial document structure? If the answer to my question is trivial would anything change if I wanted to weight the colors. For example, if the initial document looked like: Field Value IMAGE 1 COLORS 10-F00000 6-FF0000 3-FFF000 9-FFFF00 7-FFFFF0 1-FFFFFF 2-E00000 4-EE0000 8-EEE000 5-EEEE00 PIXELS 10000 FORMAT JPG That would be interpreted as FFFFFF is the most dominant of the top 10 most dominant colors and F00000 is the least. Would that be a terrible document structure? Thanx in advance, especially for wading through this entire post :o) JAMES __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]