Re: Efficiency of integer storage/use

2015-10-21 Thread Upayavira
What I'd say is that there are *substantial* optimisations done already when indexing terms, especially numerical ones, e.g. looking for common divisors. Look out for a talk by Adrien Grand at Berlin Buzzwords earlier this year for a taste of it. I don't know how much of this kind of optimisation

Re: Efficiency of integer storage/use

2015-10-21 Thread Robert Krüger
Thanks everyone, for your answers. I will probably make a simple parametric test pumping a solr index full of those integers with very limited range and then sorting by vector distances to see how the performance characteristics are. On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <

Re: Efficiency of integer storage/use

2015-10-18 Thread Mikhail Khludnev
Robert, >From what I know as inverted index as docvalues compress content much, even stored fields compressed too. So, I think you have much chance to experiment successfully. You might need tweak schema disabling storing unnecessary info in the index. On Sat, Oct 17, 2015 at 1:15 AM, Robert

Re: Efficiency of integer storage/use

2015-10-18 Thread Erick Erickson
On the surface this seems like something of a distraction. 10M docs x 100 values/docs = 1B integers. Assuming all need to be held in memory at once. My straw-man proposal: it would be much cheaper to just provision each JVM with an additional couple of gig memory and forget about it. Feel free to

Re: Efficiency of integer storage/use

2015-10-18 Thread Jack Krupansky
I'd still like to see a very clear statement of how data is stored in Lucene. For example, is there any increase in index size if you placed your 32-bit integers in a long field? Could somebody make a clear statement about what the index packing/compression would actually do - not the actual

Re: Efficiency of integer storage/use

2015-10-17 Thread Robert Krüger
Thanks for the feedback. What I am trying to do is to "abuse" integers to store 8bit (or even lower) values of metrics I use for content-based image/video search (such as statistical values regarding color distribution) and then implement similarity calculations based on formulas using vector

Efficiency of integer storage/use

2015-10-16 Thread Robert Krüger
Hi, I have a data model where I would store and index a lot of integer values with a very restricted range (e.g. 0-255), so theoretically the 32 bits of Solr's integer fields are complete overkill. I want to be able to to things like vector distance calculations on those fields. Should I worry

Re: Efficiency of integer storage/use

2015-10-16 Thread Alessandro Benedetti
Hi Robert, current Solr compression will work really well , both for Stored and DocValues contents. Related the index term dictionaries, I ask for some help to other experts as I never checked how the actual compression works in there, but I assume it is quite efficient. Usually the field type

Re: Efficiency of integer storage/use

2015-10-16 Thread Erick Erickson
Under the covers, Lucene stores ints in a packed format, so I'd just count on that for a first pass. What is "a lot of integer values"? Hundreds of millions? Billions? Trillions? Unless you give us some indication of scale, it's hard to say anything helpful. But unless you have some evidence