Re: NumberTools

Yonik Seeley Thu, 24 Mar 2005 08:56:08 -0800

> An int will be stored as a 2 char string which will be sorted "char by char" 
> so
> they will be almost as fast as sorting as integers.


John, two problems:
1) Memory consumption - string sorting uses String[] instead of int[]
2) Lucene uses UTF-8 to store strings, and you can't round-trip
arbitrary characters (see test code below that will fail.)

    char[] carr = new char[65536];
    for (int ch=0; ch<65536; ch++) carr[ch]=(char)ch;
    String s = new String(carr);
    byte[] b1 = s.getBytes("UTF-8");
    String s2 = new String(b1,"UTF-8");
    if (!s.equals(s2)) System.out.println("ERROR! Strings don't match");

-Yonik


On Tue, 22 Mar 2005 08:56:13 +0000 (UTC), John Patterson
<[EMAIL PROTECTED]> wrote:
> Chris Hostetter <hossman_lucene <at> fucit.org> writes:
> 
> > I haven't worked through the math to prove to myself that your algorithm
> > is a viable way of expressing any Integer as a 4 byte String;  such that
> > any two Integers sort lexigraphically correct as strings ... but let's
> > assume that i have, and that it works perfectly.
> 
> The concept is this:  Signed ints range from MIN_VALUE to MAX_VALUE which are
> represented in binary as:
> 
> 10000000 00000000 00000000 00000000 (MIN_VALUE)
> 11111111 11111111 11111111 11111111 (-1)
> 00000000 00000000 00000000 00000000 (0)
> 00000000 00000000 00000000 00000001 (1)
> 01111111 11111111 11111111 11111111 (MAX_VALUE)
> 
> Any number begining with a 1 is negative.  Adding 1 to -1 causes the left most
> bit to overflow and is ignored.
> 
> The problem with converting these numbers into chars (and Strings) is that
> numbers that negative numbers begin with 1 and so are actually greater than
> positive numbers that all begin with 0.  Apart from the left-most bit every
> thing is as it should be.
> 
> To remedy this we need to change the numbers to unsigned ints such that
> 
> 00000000 00000000 00000000 00000000 (MIN_VALUE)
> 01111111 11111111 11111111 11111111 (-1)
> 10000000 00000000 00000000 00000000 (0)
> 10000000 00000000 00000000 00000001 (1)
> 11111111 11111111 11111111 11111111 (MAX_VALUE)
> 
> To convert from one to the other is just a case of flipping the left most bit
> which is done by adding MIN_VALUE
> 
> 00000000 00000000 00000000 00000001 (1)
> 10000000 00000000 00000000 00000000 (MIN_VALUE)
> -----------------------------------
> 10000000 00000000 00000000 00000001 (unsigned 1)
> 
> My test case shows that this calculation works for a number of boundry cases.
> 
> >so it's going to have to Sort them as Strings, which is
> > slower then sorting them as Integers -- even if they are only 4 bytes long
> > (unless I'm wrong, which is entirely possible ... i haven't tested it).
> 
> An int will be stored as a 2 char string which will be sorted "char by char" 
> so
> they will be almost as fast as sorting as integers.  Certainly much faster 
> than
> converting the Strings back to integers first and then sorting them!
> 
> > Now, if we could override the method used to parse the Term values from
> > Strings to Integers (using a user specified NumberFormat as i proposed)
> > then we could name your "convertTotText" method as "format" and write a
> > corrisponding "parse" method, and everything would work smashing.
> 
> It would be great if this could be incorporated into Lucene as it will make
> numeric searches much more efficient.  I will soon need to store simple
> geographical data in my index to do a "find the nearest x" type of search.
> 
> John
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NumberTools

Reply via email to