Hey Everyone,

I have been reading several threads about facet counts
and category counts and was wondering if/how they
might apply to searching for colors. Let's say that
there is a Lucene index where each document
corresponds to an image. In addition, each document
contains the top 10 most frequently occurring colors 
(hex values) in its associated image along with some
other axillary data. So a few example documents might
look like this:

Field   Value
IMAGE   1
COLORS  F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
E00000 EE0000 EEE000 EEEE00
PIXELS  10000
FORMAT  JPG

Field   Value
IMAGE   2
COLORS  F00000 F0F000 F0FF00 F0FFF0 F0FFFF FFFFFF
E00000 E0E000 E0EE00 E0EEE0
PIXELS  50000
FORMAT  JPG

Field   Value
IMAGE   3
COLORS  F00000 F00F00 F00FF0 F00FFF FFFFFF E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS  1000
FORMAT  PNG

Field   Value
IMAGE   4
COLORS  F00000 F00F00 F00FF0 F00FFF CCCCCC E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS  20000
FORMAT  GIF

If I setup the COLORS field such that it is of type
Field.Keyword, then I can construct a query
COLORS:FFFFFF with a StandardAnalyzer that returns the
first 3 documents. The additional bit of information
I'd like to calculate/return is the list of colors the
images share, in order of most frequent to least
frequent i.e.

F00000, 3 occurrences
E00000, 3 occurrences
E0E000, 2 occurrences
E0EE00, 2 occurrences
FF0000, 1 occurrence
...

I could use a custom HitCollector to collect and order
this information, but it would cause a sizable
performance hit.This thread from last year:
http://www.nabble.com/forum/ViewTopic.jtp?topic=266441&tview=threaded

Seems to suggest the creation of a String[] of colors
and then iterate over it using a QueryFilter, AND'ing
the result sets bits with the filtered bits, sorting
based on cardinality. That works in my tests with a
limited number of colors, but my String[] of colors
could contain 16.7M values, and most certainly will
contain at least several million.

I read up on BitSets and Solr's use of them as well in
this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/35427?do=post_view_threaded

But I think that the solutions suggested assume
several hundred facets i.e. price, manufacturer, size,
etc. each one having a few hundred possible values.
Those would make sense so that I could provide
filtering based on FORMAT or number of PIXELS, but
again, don't seem to scale in the colors example.

Am I mistaken in thinking that my problem is similar
to faceting/categorization/rollup, just on a different
scale? I suppose I could also structure my document
such that it looked like this:

Field   Value
IMAGE   1
COLOR   F00000
COLOR   FF0000
COLOR   FFF000
COLOR   FFFF00
COLOR   FFFFF0
COLOR   FFFFFF
COLOR   E00000
COLOR   EE0000
COLOR   EEE000
COLOR   EEEE00
PIXELS  10000
FORMAT  JPG

That would allow me to do something with TermEnums,
but I don't see how it would be any better than using
a HitCollector. Does the above document structure make
things any better/worse/the same compared to the
initial document structure? If the answer to my
question is trivial would anything change if I wanted
to weight the colors. For example, if the initial
document looked like:

Field   Value
IMAGE   1
COLORS  10-F00000 6-FF0000 3-FFF000 9-FFFF00 7-FFFFF0
1-FFFFFF 2-E00000 4-EE0000 8-EEE000 5-EEEE00
PIXELS  10000
FORMAT  JPG

That would be interpreted as FFFFFF is the most
dominant of the top 10 most dominant colors and F00000
is the least. Would that be a terrible document
structure?

Thanx in advance, especially for wading through this
entire post :o)

JAMES

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to