Re: Efficiency of integer storage/use

Upayavira Wed, 21 Oct 2015 07:23:56 -0700

What I'd say is that there are *substantial* optimisations done already
when indexing terms, especially numerical ones, e.g. looking for common
divisors. Look out for a talk by Adrien Grand at Berlin Buzzwords
earlier this year for a taste of it.


I don't know how much of this kind of optimisation has been done on doc
values. I suspect not much yet (e.g. when committing a set of integer
values into a new segment, look to see how big these values are and set
the number of bits in the array accordingly).

If you want to make these sorts of optimisations, I'd suggest looking at
how docvalues are coded on disk, and see if you can make changes that
would benefit all users across the board.

Upayavira

On Wed, Oct 21, 2015, at 08:52 AM, Robert Krüger wrote:
> Thanks everyone, for your answers. I will probably make a simple
> parametric
> test pumping a solr index full of those integers with very limited range
> and then sorting by vector distances to see how the performance
> characteristics are.
> 
> On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
> 
> > Robert,
> > From what I know as inverted index as docvalues compress content much, even
> > stored fields compressed too. So, I think you have much chance to
> > experiment successfully. You might need tweak schema disabling storing
> > unnecessary info in the index.
> >
> > On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de>
> > wrote:
> >
> > > Thanks for the feedback.
> > >
> > > What I am trying to do is to "abuse" integers to store 8bit (or even
> > lower)
> > > values of metrics I use for content-based image/video search (such as
> > > statistical values regarding color distribution) and then implement
> > > similarity calculations based on formulas using vector distances. The
> > Index
> > > can become large (tens of millions of documents each with say 50-100
> > > integers  describing the image metrics). I am looking at using a part of
> > > those metrics for selecting a subset of images using range queries and
> > then
> > > more for sorting the result set by relevance.
> > >
> > > I was first looking at implementing those metrics as binary fields (see
> > > other posting) and then use a custom function for the distance
> > calculation
> > > but so far I got the impression that way is not supported really well by
> > > Solr. Base64-En/Decoding would kill performance and implementing a custom
> > > field type with all that is probably required for that to work properly
> > is
> > > currently beyond my Solr knowledge. Besides, using built-in Solr features
> > > makes it easier to finetune/experiment with different approaches,
> > because I
> > > can just play around with different queries and see what works best,
> > > without each time adjusting a custom function.
> > >
> > > I hope that provides a better picture of what I am trying to achieve.
> > >
> > > Best,
> > >
> > > Robert
> > >
> > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > Under the covers, Lucene stores ints in a packed format, so I'd just
> > > count
> > > > on that for a first pass.
> > > >
> > > > What is "a lot of integer values"? Hundreds of millions? Billions?
> > > > Trillions?
> > > >
> > > > Unless you give us some indication of scale, it's hard to say anything
> > > > helpful. But unless you have some evidence that your going to blow out
> > > > memory I'd just ignore the "wasted" bits. Especially if you can use
> > > > docValues,
> > > > that option holds much of the underlying data in MMapDirectory
> > > > that uses swappable OS memory....
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> > > > wrote:
> > > > > Hi,
> > > > >
> > > > > I have a data model where I would store and index a lot of integer
> > > values
> > > > > with a very restricted range (e.g. 0-255), so theoretically the 32
> > bits
> > > > of
> > > > > Solr's integer fields are complete overkill. I want to be able to to
> > > > things
> > > > > like vector distance calculations on those fields. Should I worry
> > about
> > > > the
> > > > > "wasted" bits or will Solr compress/organize the index in a way that
> > > > > compensates for this if there are only 256 (or even fewer) distinct
> > > > values?
> > > > >
> > > > > Any recommendations on how my fields should be defined to make things
> > > > like
> > > > > numeric functions work as fast as technically possible?
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > > Robert
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Krüger
> > > Managing Partner
> > > Lesspain GmbH & Co. KG
> > >
> > > www.lesspain-software.com
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mkhlud...@griddynamics.com>
> >
> 
> 
> 
> -- 
> Robert Krüger
> Managing Partner
> Lesspain GmbH & Co. KG
> 
> www.lesspain-software.com

Re: Efficiency of integer storage/use

Reply via email to