On Fri, Jun 19, 2009 at 9:25 AM, Marcel Reutegger<[email protected]> wrote: > Hi Ard, > > I think this discussion rather belongs to the dev list.
Yes you are right.. :-) Ard > > I'll reply there... > > regards > marcel > > On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers<[email protected]> > wrote: >> Hello Marcel, >> >> As I like this solution, it seems to me to only suitable for dates, >> right? How do we know that we are sorting on a date...by checking >> whethet it has length 9..or that it starts with msq? Furthermore, I am >> quite curious how you implemented this below. If you just used >> substrings, we could gain quite a bit more with, but i am not sure >> whether you already do this: >> >> Suppose >> >> String s = "msqyw2shb"; >> >> If you are having >> >> String[0] = s.subString(0,3); >> >> we reduce memory usage quite a bit more with >> >> String[0] = new String(s.subString(0,3)) >> >> Also see [1]. But perhaps you are already doing this. >> >> A direct small improvement we could directly make is replacing : >> >> retArray[termDocs.doc()] = term.text().substring(prefix.length()); >> >> with >> >> retArray[termDocs.doc()] = new >> String(term.text().substring(prefix.length())); >> >> It is a bit strange, but as for dates I think the prefix.length is >> something like "lastModified" and a delimiter, suppose 13 chars..this >> would bring back the char array retained in memory back from 22 to >> 9...(for dates) >> >> Furthermore, it follows that using short property names saves you >> memory. This could be avoided in the end if we index each property in >> its own lucene field, instead of all in :_PROPERTIES and prefix the >> value with the propertyname..this though requires quite some rewrite >> for indexing i think. >> >> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622 >> >> >> >> On Thu, Jun 18, 2009 at 1:25 PM, Marcel >> Reutegger<[email protected]> wrote: >>> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> >>> wrote: >>>> If you happen to find the holy grail solution, I suppose you'll let us know >>>> :-) Also if you would have some memory usage numbers with and without the >>>> suggestion of mine regarding reducing the precision of you Date field, this >>>> would be very valuable. >>> >>> hmm, I'm been thinking about a solution that I would call >>> flyweight-substring-collation-key. it assumes that there is usually a >>> major overlap of substrings of the the values to sort on. i.e. a >>> lastModified value. so instead of always keeping the entire value we'd >>> have a collation key that references multiple reusable substrings. >>> >>> assume we have the following values: >>> >>> - msqyw2shb >>> - msqyw2t93 >>> - msqyw2u0v >>> - msqyw2usn >>> - msqyw2vkf >>> - msqyw2wc7 >>> - msqyw2x3z >>> - msqyw2xvr >>> - msqyw2ynj >>> - msqyw2zfb >>> >>> (those are date property values each 1 second after the previous one) >>> >>> we could create collation keys for use as comparable in the field >>> cache like this: >>> >>> substring cache: >>> [0] msq >>> [1] shb >>> [2] t93 >>> [3] u0v >>> [4] usn >>> [5] vkf >>> [6] wc7 >>> [7] x3z >>> [8] xvr >>> [9] ynj >>> [10] yw2 >>> [11] zfb >>> >>> and then the actual comparable that reference the substrings in the cache: >>> >>> - {0, 10, 1} >>> - {0, 10, 2} >>> - {0, 10, 3} >>> - {0, 10, 4} >>> - {0, 10, 5} >>> - {0, 10, 6} >>> - {0, 10, 7} >>> - {0, 10, 8} >>> - {0, 10, 9} >>> - {0, 10, 11} >>> >>> this will result in a lower memory consumption and using the reference >>> indexes could even speed up the comparison. >>> >>> a quick test with 1 million dates values showed that the memory >>> consumption drops to 50% with this approach. >>> >>> regards >>> marcel >>> >> >
