> First off, let me clear up somethign regarding your > index field structure, > you mentioned that you currently have documents that > look like this... > > : IMAGE 1 > : COLORS F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF > : E00000 EE0000 EEE000 EEEE00 > > If you are indexing it as Field.Keyword and you can > query for > COLORS:FFFFFF and get results, then either you are > only only getting > documents that are 100% white, or when you indexed > the Documents you added > eeach collor as a seperate field instance
I thought that having: F00000 FF0000 FFFFFF FFFF00... in one field and then searching for FFFFFF in it would match all documents that contain that "word" so I would be finding documents that could be < 100% white. In either case, I don't actually have a document structure yet ;o) It seems though that you are recommending this document structure: Field Value IMAGE 1 COLOR FF0000 COLOR 00FF00 COLOR 0000FF COLOR 808080 And then if this image were 50% Red, 25% Green, 12.5% Blue and 12.5% Grey the document would look like this: Field Value IMAGE 1 COLOR FF0000 COLOR FF0000 COLOR FF0000 COLOR FF0000 COLOR 00FF00 COLOR 00FF00 COLOR 0000FF COLOR 808080 Is that correct? I wrote a unit test to compare term document frequencies while iterating through the TermEnum object returned by IndexReader.terms() and the counts were equal. I guess I am still not clear on what the differences/advantages/disadvantages are between the above an document and one that like this: Field Value IMAGE 1 COLORS FF0000 FF0000 FF0000 FF0000 00FF00 00FF00 0000FF 808080 In fact with all the color values stored in a single field, if I did some sort of pixel region mapping/resolution reduction then wouldn't it lend itself to Span/Fuzzy queries so I could not only search for images which have both red and green, but rank them based on how close the colors are to each other? Would I lose that capability if I stored each color in its own field? So perhaps a more general question is when is it better to collapse a bunch of words into a single field vs. spread them out over many fields, all of which have the same field name? > If you index both the full color codes as well as > just the most > significant hex characters from the RGB code in > seperate fields, you can > do "coarse" faceting on one field, and "refined" > faceting on the other... > > IMAGE: 1 > COLOR: FFFFFE FFA3B4 2287C3 773442 666BED > COARSE_COLOR: FFF FAB 28C 734 66E That's a great idea and seems way more straightforward than the RGB/HSV etc. distance calculation algorithms I've been reading about :o) I'll have to run some tests to see how accurate that reduction appears to people. > Usually, you need to do one pass over your values > and one pass over > your documents -- with a HitCollector you do a pass > over your values first > Alternately, you can do one pass over > your documents building > a BitSet of the docs that match, and *then* do a > pass over the values Huh, perhaps I don't understand the HitCollector fully. Are you saying that if I have an index with 100 documents, each of which have a color field (let's say 25 of the documents have FFFFFF in the color field) and I do a search for FFFFFF...using a HitCollector I'm iterating over all 100 documents while extracting values whereas with the regular Hits based search I would only be iterating over 25? Thanx again. JAMES __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]