Re: Query that sorts a large result set.

Ard Schrijvers Fri, 19 Jun 2009 00:36:03 -0700

On Fri, Jun 19, 2009 at 9:25 AM, Marcel
Reutegger<[email protected]> wrote:
> Hi Ard,
>
> I think this discussion rather belongs to the dev list.


Yes you are right.. :-)

Ard

>
> I'll reply there...
>
> regards
>  marcel
>
> On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers<[email protected]> 
> wrote:
>> Hello Marcel,
>>
>> As I like this solution, it seems to me to only suitable for dates,
>> right? How do we know that we are sorting on a date...by checking
>> whethet it has length 9..or that it starts with msq? Furthermore, I am
>> quite curious how you implemented this below. If you just used
>> substrings, we could gain quite a bit more with, but i am not sure
>> whether you already do this:
>>
>> Suppose
>>
>> String s = "msqyw2shb";
>>
>> If you are having
>>
>> String[0] = s.subString(0,3);
>>
>> we reduce memory usage quite a bit more with
>>
>> String[0] = new String(s.subString(0,3))
>>
>> Also see [1]. But perhaps you are already doing this.
>>
>> A direct small improvement we could directly make is replacing :
>>
>> retArray[termDocs.doc()] = term.text().substring(prefix.length());
>>
>> with
>>
>> retArray[termDocs.doc()] = new 
>> String(term.text().substring(prefix.length()));
>>
>> It is a bit strange, but as for dates I think the prefix.length is
>> something like "lastModified" and a delimiter, suppose 13 chars..this
>> would bring back the char array retained in memory back from 22 to
>> 9...(for dates)
>>
>> Furthermore, it follows that using short property names saves you
>> memory. This could be avoided in the end if we index each  property in
>> its own lucene field, instead of all in :_PROPERTIES and prefix the
>> value with the propertyname..this though requires quite some rewrite
>> for indexing i think.
>>
>> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622
>>
>>
>>
>> On Thu, Jun 18, 2009 at 1:25 PM, Marcel
>> Reutegger<[email protected]> wrote:
>>> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> 
>>> wrote:
>>>> If you happen to find the holy grail solution, I suppose you'll let us know
>>>> :-) Also if you would have some memory usage numbers with and without the
>>>> suggestion of mine regarding reducing the precision of you Date field, this
>>>> would be very valuable.
>>>
>>> hmm, I'm been thinking about a solution that I would call
>>> flyweight-substring-collation-key. it assumes that there is usually a
>>> major overlap of substrings of the the values to sort on. i.e. a
>>> lastModified value. so instead of always keeping the entire value we'd
>>> have a collation key that references multiple reusable substrings.
>>>
>>> assume we have the following values:
>>>
>>> - msqyw2shb
>>> - msqyw2t93
>>> - msqyw2u0v
>>> - msqyw2usn
>>> - msqyw2vkf
>>> - msqyw2wc7
>>> - msqyw2x3z
>>> - msqyw2xvr
>>> - msqyw2ynj
>>> - msqyw2zfb
>>>
>>> (those are date property values each 1 second after the previous one)
>>>
>>> we could create collation keys for use as comparable in the field
>>> cache like this:
>>>
>>> substring cache:
>>> [0] msq
>>> [1] shb
>>> [2] t93
>>> [3] u0v
>>> [4] usn
>>> [5] vkf
>>> [6] wc7
>>> [7] x3z
>>> [8] xvr
>>> [9] ynj
>>> [10] yw2
>>> [11] zfb
>>>
>>> and then the actual comparable that reference the substrings in the cache:
>>>
>>> - {0, 10, 1}
>>> - {0, 10, 2}
>>> - {0, 10, 3}
>>> - {0, 10, 4}
>>> - {0, 10, 5}
>>> - {0, 10, 6}
>>> - {0, 10, 7}
>>> - {0, 10, 8}
>>> - {0, 10, 9}
>>> - {0, 10, 11}
>>>
>>> this will result in a lower memory consumption and using the reference
>>> indexes could even speed up the comparison.
>>>
>>> a quick test with 1 million dates values showed that the memory
>>> consumption drops to 50% with this approach.
>>>
>>> regards
>>>  marcel
>>>
>>
>

Re: Query that sorts a large result set.

Reply via email to