Thanks everyone, for your answers. I will probably make a simple parametric
test pumping a solr index full of those integers with very limited range
and then sorting by vector distances to see how the performance
characteristics are.

On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Robert,
> From what I know as inverted index as docvalues compress content much, even
> stored fields compressed too. So, I think you have much chance to
> experiment successfully. You might need tweak schema disabling storing
> unnecessary info in the index.
>
> On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de>
> wrote:
>
> > Thanks for the feedback.
> >
> > What I am trying to do is to "abuse" integers to store 8bit (or even
> lower)
> > values of metrics I use for content-based image/video search (such as
> > statistical values regarding color distribution) and then implement
> > similarity calculations based on formulas using vector distances. The
> Index
> > can become large (tens of millions of documents each with say 50-100
> > integers  describing the image metrics). I am looking at using a part of
> > those metrics for selecting a subset of images using range queries and
> then
> > more for sorting the result set by relevance.
> >
> > I was first looking at implementing those metrics as binary fields (see
> > other posting) and then use a custom function for the distance
> calculation
> > but so far I got the impression that way is not supported really well by
> > Solr. Base64-En/Decoding would kill performance and implementing a custom
> > field type with all that is probably required for that to work properly
> is
> > currently beyond my Solr knowledge. Besides, using built-in Solr features
> > makes it easier to finetune/experiment with different approaches,
> because I
> > can just play around with different queries and see what works best,
> > without each time adjusting a custom function.
> >
> > I hope that provides a better picture of what I am trying to achieve.
> >
> > Best,
> >
> > Robert
> >
> > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> > > Under the covers, Lucene stores ints in a packed format, so I'd just
> > count
> > > on that for a first pass.
> > >
> > > What is "a lot of integer values"? Hundreds of millions? Billions?
> > > Trillions?
> > >
> > > Unless you give us some indication of scale, it's hard to say anything
> > > helpful. But unless you have some evidence that your going to blow out
> > > memory I'd just ignore the "wasted" bits. Especially if you can use
> > > docValues,
> > > that option holds much of the underlying data in MMapDirectory
> > > that uses swappable OS memory....
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> > > wrote:
> > > > Hi,
> > > >
> > > > I have a data model where I would store and index a lot of integer
> > values
> > > > with a very restricted range (e.g. 0-255), so theoretically the 32
> bits
> > > of
> > > > Solr's integer fields are complete overkill. I want to be able to to
> > > things
> > > > like vector distance calculations on those fields. Should I worry
> about
> > > the
> > > > "wasted" bits or will Solr compress/organize the index in a way that
> > > > compensates for this if there are only 256 (or even fewer) distinct
> > > values?
> > > >
> > > > Any recommendations on how my fields should be defined to make things
> > > like
> > > > numeric functions work as fast as technically possible?
> > > >
> > > > Thanks in advance,
> > > >
> > > > Robert
> > >
> >
> >
> >
> > --
> > Robert Krüger
> > Managing Partner
> > Lesspain GmbH & Co. KG
> >
> > www.lesspain-software.com
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com

Reply via email to