On 8/3/16, 11:41 PM, "bilbosax" <waspenc...@comcast.net> wrote:

>You just stepped outside of my academic sphere :)  Although, I did a quick
>wiki scan on hashes and it sounds fascinating, but I don't know how well
>it
>would work in this situation.  I think I would have to do a hash 38k times
>if I understand it correctly.  I don't have a certain set of values that I
>am looking for to create a hash.  The record that I am looking at defines
>all of the values that need compared, and those values change from record
>to
>record.  So the search characteristics are constantly changing.  But, it
>could be that I simply do not understand hashes, because I just glanced at
>it.  I will take a deeper look into it soon, as it seems like something I
>should know to a certain degree of competency.
>

Well, generalized hashing is a big topic.  What you are doing here is much
simpler.  Maybe another way to think about it is as the generation of
identifiers or keys for each object where if two identifiers match, then
you know the two items should be compared to each other.

All of the properties you use in your chain of if statements during the
compare become the variables in the hash/identifier function.  The key
things to keep in mind are that this gives you the opportunity to do the
comparison inside the AIR runtime (in C code) instead of in ActionScript
because the property lookup on the Object runs inside AIR, and that this
dramatically changes the number of comparisons, which was the bottleneck
in your app.

Your current code was/is (roughly):

for (i = 0; i < n; i++)
  for (j = 0; j < n, j++)
    if (i != j &&
        item[i].prop1 == item[j].prop1 &&
        item[i].prop2 == item[j].prop2 ...

The "if"s are run n^2 times.  If you have 4 items, 16 times, for 100
items, 10,000 times and for 38,000 items, the ifs are being run
1,444,000,000 times.

Since the comparison of itemI to itemJ is the same as comparing itemJ to
itemI, your loop only really needs to look like:

for (i = 0; i < n - 1; i++)
  for (j = i; j < n, j++)
    if (item[i].prop1 == item[j].prop1 &&
        item[i].prop2 == item[j].prop2 ...

Then, the "ifs" are run much fewer times.  If you have 4 items, only 6
times, for 100 items, 4950 times, and for 38,000 items, only 721,981,000
times, which should cut your total time in half.


But then, if you use hashing, the "if" may take longer because it won't
have early exits, but if you have 4 items, the test is only run 4 times,
for 100 items only 100 times and for 38,000 items, only 38,000 times.  You
can see that that is way fewer times than 721,981,000 times so you can
usually afford the more expensive hash calculation.

The simplest hash is the concatenation of the string representation of all
the properties you are currently comparing, assuming that no property will
have the value of empty string "".  I believe you can always test for ""
and swap for something else, even a simple space " ".  If there are
numeric properties you might want to add a delimiter so that "11" + "1" is
different from "1" + "11".

Let's say I want to rummage through comments left on our wiki and find
folks who submitted more than one.  Each comment record might have:

class CommentRecord
{
  Var date:Date
  var firstName:String
  var lastName:String
  var country:String
  var state:String
  var city:String
  var street:String
  var houseNumber:String
  var phoneNumber:String
  var comment:String
}








If one can assume that
country+state+city+street+houseNumber+lastName+firstName would uniquely
identify someone who left a comment on the wiki, then we can find folks
who've left more than one comment by hash-sorting them into buckets as I
described in my previous post.  The computeHash() could be as simple as:

function computeHash(item:Object):String
{
  return item.country + item.state + item.street +
         item.houseNumber + item.lastName +
         item.firstName;
}

I would expect this algorithm to scale much better as the number of
comments grows since even at 100,000 records it is only 100,000 tests.
Also note that you could store the hash in the DB so it doesn't need to be
computed each time you start up the app.

HTH,
-Alex

Reply via email to