Thanks everyone, for your answers. I will probably make a simple parametric test pumping a solr index full of those integers with very limited range and then sorting by vector distances to see how the performance characteristics are.
On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev < mkhlud...@griddynamics.com> wrote: > Robert, > From what I know as inverted index as docvalues compress content much, even > stored fields compressed too. So, I think you have much chance to > experiment successfully. You might need tweak schema disabling storing > unnecessary info in the index. > > On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de> > wrote: > > > Thanks for the feedback. > > > > What I am trying to do is to "abuse" integers to store 8bit (or even > lower) > > values of metrics I use for content-based image/video search (such as > > statistical values regarding color distribution) and then implement > > similarity calculations based on formulas using vector distances. The > Index > > can become large (tens of millions of documents each with say 50-100 > > integers describing the image metrics). I am looking at using a part of > > those metrics for selecting a subset of images using range queries and > then > > more for sorting the result set by relevance. > > > > I was first looking at implementing those metrics as binary fields (see > > other posting) and then use a custom function for the distance > calculation > > but so far I got the impression that way is not supported really well by > > Solr. Base64-En/Decoding would kill performance and implementing a custom > > field type with all that is probably required for that to work properly > is > > currently beyond my Solr knowledge. Besides, using built-in Solr features > > makes it easier to finetune/experiment with different approaches, > because I > > can just play around with different queries and see what works best, > > without each time adjusting a custom function. > > > > I hope that provides a better picture of what I am trying to achieve. > > > > Best, > > > > Robert > > > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > > > Under the covers, Lucene stores ints in a packed format, so I'd just > > count > > > on that for a first pass. > > > > > > What is "a lot of integer values"? Hundreds of millions? Billions? > > > Trillions? > > > > > > Unless you give us some indication of scale, it's hard to say anything > > > helpful. But unless you have some evidence that your going to blow out > > > memory I'd just ignore the "wasted" bits. Especially if you can use > > > docValues, > > > that option holds much of the underlying data in MMapDirectory > > > that uses swappable OS memory.... > > > > > > Best, > > > Erick > > > > > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de> > > > wrote: > > > > Hi, > > > > > > > > I have a data model where I would store and index a lot of integer > > values > > > > with a very restricted range (e.g. 0-255), so theoretically the 32 > bits > > > of > > > > Solr's integer fields are complete overkill. I want to be able to to > > > things > > > > like vector distance calculations on those fields. Should I worry > about > > > the > > > > "wasted" bits or will Solr compress/organize the index in a way that > > > > compensates for this if there are only 256 (or even fewer) distinct > > > values? > > > > > > > > Any recommendations on how my fields should be defined to make things > > > like > > > > numeric functions work as fast as technically possible? > > > > > > > > Thanks in advance, > > > > > > > > Robert > > > > > > > > > > > -- > > Robert Krüger > > Managing Partner > > Lesspain GmbH & Co. KG > > > > www.lesspain-software.com > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <mkhlud...@griddynamics.com> > -- Robert Krüger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com