> An int will be stored as a 2 char string which will be sorted "char by char" > so > they will be almost as fast as sorting as integers.
John, two problems: 1) Memory consumption - string sorting uses String[] instead of int[] 2) Lucene uses UTF-8 to store strings, and you can't round-trip arbitrary characters (see test code below that will fail.) char[] carr = new char[65536]; for (int ch=0; ch<65536; ch++) carr[ch]=(char)ch; String s = new String(carr); byte[] b1 = s.getBytes("UTF-8"); String s2 = new String(b1,"UTF-8"); if (!s.equals(s2)) System.out.println("ERROR! Strings don't match"); -Yonik On Tue, 22 Mar 2005 08:56:13 +0000 (UTC), John Patterson <[EMAIL PROTECTED]> wrote: > Chris Hostetter <hossman_lucene <at> fucit.org> writes: > > > I haven't worked through the math to prove to myself that your algorithm > > is a viable way of expressing any Integer as a 4 byte String; such that > > any two Integers sort lexigraphically correct as strings ... but let's > > assume that i have, and that it works perfectly. > > The concept is this: Signed ints range from MIN_VALUE to MAX_VALUE which are > represented in binary as: > > 10000000 00000000 00000000 00000000 (MIN_VALUE) > 11111111 11111111 11111111 11111111 (-1) > 00000000 00000000 00000000 00000000 (0) > 00000000 00000000 00000000 00000001 (1) > 01111111 11111111 11111111 11111111 (MAX_VALUE) > > Any number begining with a 1 is negative. Adding 1 to -1 causes the left most > bit to overflow and is ignored. > > The problem with converting these numbers into chars (and Strings) is that > numbers that negative numbers begin with 1 and so are actually greater than > positive numbers that all begin with 0. Apart from the left-most bit every > thing is as it should be. > > To remedy this we need to change the numbers to unsigned ints such that > > 00000000 00000000 00000000 00000000 (MIN_VALUE) > 01111111 11111111 11111111 11111111 (-1) > 10000000 00000000 00000000 00000000 (0) > 10000000 00000000 00000000 00000001 (1) > 11111111 11111111 11111111 11111111 (MAX_VALUE) > > To convert from one to the other is just a case of flipping the left most bit > which is done by adding MIN_VALUE > > 00000000 00000000 00000000 00000001 (1) > 10000000 00000000 00000000 00000000 (MIN_VALUE) > ----------------------------------- > 10000000 00000000 00000000 00000001 (unsigned 1) > > My test case shows that this calculation works for a number of boundry cases. > > >so it's going to have to Sort them as Strings, which is > > slower then sorting them as Integers -- even if they are only 4 bytes long > > (unless I'm wrong, which is entirely possible ... i haven't tested it). > > An int will be stored as a 2 char string which will be sorted "char by char" > so > they will be almost as fast as sorting as integers. Certainly much faster > than > converting the Strings back to integers first and then sorting them! > > > Now, if we could override the method used to parse the Term values from > > Strings to Integers (using a user specified NumberFormat as i proposed) > > then we could name your "convertTotText" method as "format" and write a > > corrisponding "parse" method, and everything would work smashing. > > It would be great if this could be incorporated into Lucene as it will make > numeric searches much more efficient. I will soon need to store simple > geographical data in my index to do a "find the nearest x" type of search. > > John > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]