Chris Hostetter <hossman_lucene <at> fucit.org> writes:

> I haven't worked through the math to prove to myself that your algorithm
> is a viable way of expressing any Integer as a 4 byte String;  such that
> any two Integers sort lexigraphically correct as strings ... but let's
> assume that i have, and that it works perfectly.

The concept is this:  Signed ints range from MIN_VALUE to MAX_VALUE which are
represented in binary as:

10000000 00000000 00000000 00000000 (MIN_VALUE)
11111111 11111111 11111111 11111111 (-1)
00000000 00000000 00000000 00000000 (0)
00000000 00000000 00000000 00000001 (1)
01111111 11111111 11111111 11111111 (MAX_VALUE)

Any number begining with a 1 is negative.  Adding 1 to -1 causes the left most
bit to overflow and is ignored.

The problem with converting these numbers into chars (and Strings) is that
numbers that negative numbers begin with 1 and so are actually greater than
positive numbers that all begin with 0.  Apart from the left-most bit every
thing is as it should be.

To remedy this we need to change the numbers to unsigned ints such that

00000000 00000000 00000000 00000000 (MIN_VALUE)
01111111 11111111 11111111 11111111 (-1)
10000000 00000000 00000000 00000000 (0)
10000000 00000000 00000000 00000001 (1)
11111111 11111111 11111111 11111111 (MAX_VALUE)

To convert from one to the other is just a case of flipping the left most bit
which is done by adding MIN_VALUE

00000000 00000000 00000000 00000001 (1)
10000000 00000000 00000000 00000000 (MIN_VALUE)
-----------------------------------
10000000 00000000 00000000 00000001 (unsigned 1)

My test case shows that this calculation works for a number of boundry cases.

>so it's going to have to Sort them as Strings, which is
> slower then sorting them as Integers -- even if they are only 4 bytes long
> (unless I'm wrong, which is entirely possible ... i haven't tested it).

An int will be stored as a 2 char string which will be sorted "char by char" so
they will be almost as fast as sorting as integers.  Certainly much faster than
converting the Strings back to integers first and then sorting them!
 
> Now, if we could override the method used to parse the Term values from
> Strings to Integers (using a user specified NumberFormat as i proposed)
> then we could name your "convertTotText" method as "format" and write a
> corrisponding "parse" method, and everything would work smashing.

It would be great if this could be incorporated into Lucene as it will make
numeric searches much more efficient.  I will soon need to store simple
geographical data in my index to do a "find the nearest x" type of search.

John




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to