[sqlite] Re: speeding up a query

P Kishor Sun, 27 Aug 2006 20:57:31 -0700

Posting an update to my problem. I have a solution that seems to work.
I am posting my solution here just in case someone can suggest methods
for further gains in efficiency.


To recap the problem --
1. 210k polys, 5.25 million points.
2. For each point, update its attribute (name)) with that of the poly
it falls within.
3. All points fall in one and only one poly.
4. All polys may not contain a point.

I built the following tables and indexes in SQLite 3.3.5 (I am using
DBD::SQLite 1.12 on WinXP with ActiveState Perl 5.8.8)

   CREATE TABLE polys (
     id   INTEGER PRIMARY KEY,
     xmin REAL,
     ymin REAL,
     xmax REAL,
     ymax REAL,
     name TEXT,
     n    INTEGER,
     ar_x TEXT,
     ar_y TEXT
   )

   CREATE TABLE points (
     id   INTEGER PRIMARY KEY,
     x    REAL,
     y    REAL,
     name TEXT
   )

   CREATE INDEX ix_polys ON polys (xmin, ymin, xmax, ymax)

   CREATE INDEX ix_points ON points (x, y)

Then I calculate the bounding box of each poly and update xmin, ymin,
xmax, ymax in the polys table. I also unpack each of the polys and
update polys.n with the number of points in the poly, and polys.ar_x
and ar_y with a list of all the x coords and y coords that make up the
poly. I store the list of coords as a comma-separate list that I can
convert back into an array using split.

Then I loop over each poly in the table and select all the points that
might potentially fall within it using the following query --

   SELECT py.name, py.n, py.ar_x, py.ar_y, pt.id, pt.x, pt.y
   FROM polys py JOIN points pt ON
     (py.xmin < pt.x AND py.ymin < pt.y AND py.xmax > pt.x AND py.ymax > pt.y)
   WHERE py.id = ?

Since polys can be irregular, a given point may fall within its
bounding box, but still not be within the poly. So I make a definitive
check of that for every point selected in the above query against that
poly, and if the is positive, I

   UPDATE points SET name = ? WHERE id IN (?)

Have been testing extensively. The latest test --

perl test_point_in_poly.pl

5000.10000.15000.20000.25000.30000.35000.40000.45000.50000.55000.60000.65000.70000.75000.80000.85000.90000.95000.100000.
processed 100000 polys, updated 2948276 points
Total time: 8451 wallclock secs (6340.54 usr + 200.52 sys = 6541.06 CPU)

The above screen dump shows 100,000 polys processed (about half of
total), and almost 3 million points within them updated with the
corresponding name (almost 60% of the points). Time taken is slightly
less than 2.5 hours. Extrapolating, the entire set should take less
than 5 hours. That is on my piddly Dell laptop which has now only 200
Mb of space left. This process is entirely CPU bound, so it will
likely be way more quick on the production machine with its faster
processor. There is prep work required before and after, but even with
that, it should be in the range of about 5 hours or so.

Db questions -- are my indexes optimum? Is storing coords as textual
lists in the database detrimental in anyway speedwise? Some of these
coords lists can be pretty long. Is my SELECT a speed demon or a speed
hog? And, finally, is my UPDATE being affected negatively because of
the index on the table?


On 8/26/06, P Kishor <[EMAIL PROTECTED]> wrote:

Greets,

Using SQLite for Windows 3.3.7. I have the following two tables
-- bounding box of each poly
CREATE TABLE polys (
  poly_id INTEGER PRIMARY KEY,
  xmin    REAL,
  ymin    REAL,
  xmax    REAL,
  ymax    REAL,
  name    TEXT
)

data look like so

1|1723885.18957644|282631.95646140|1727224.46537863|287816.54753434|100030117004
2|1716073.04710809|281166.39606662|1725746.97483727|287098.00276086|100030118009

-- coord pair of each point
CREATE TABLE points (
  point_id  INTEGER PRIMARY KEY,
  x         REAL,
  y         REAL,
  name      TEXT
)

data look like so

1|-2268900.28781180|191367.60670709|
2|-2269660.73941476|193426.66511514|

I have built the following indexes

CREATE INDEX ix_polys_xmin ON polys (xmin)
CREATE INDEX ix_polys_ymin ON polys (ymin)
CREATE INDEX ix_polys_xmax ON polys (xmax)
CREATE INDEX ix_polys_ymax ON polys (ymax)
CREATE INDEX ix_points_x ON points (x)
CREATE INDEX ix_points_y ON points (y)

Here is what I want to do: I want to SET points.name = polys.name WHERE

  polys.xmin < (SELECT points.x FROM points WHERE point_id = ?) AND
  polys.ymin < (SELECT points.y FROM points WHERE point_id = ?) AND
  polys.xmax > (SELECT points.x FROM points WHERE point_id = ?) AND
  polys.ymax > (SELECT points.y FROM points WHERE point_id = ?)

yields one and only one result. For all other records, I want to run a
further point-in-poly function which enables me to find an exact
match. In other words, I minimize the number of times I have to run my
point-in-poly function by getting SQLite's help in eliminating the
points I am know to definitely fall inside a specific poly.

So, to test my approach, I set up a loop over the points (I am using
Perl DBI), and tried the following --

  foreach point_id

    SELECT name
    FROM polys
    WHERE
      xmin < (SELECT x FROM points WHERE point_id = ?) AND
      ymin < (SELECT y FROM points WHERE point_id = ?) AND
      xmax > (SELECT x FROM points WHERE point_id = ?) AND
      ymax > (SELECT y FROM points WHERE point_id = ?)

    If only one polys.name was found, that's it

    elsif more than one polys.name were found
      run another function to determine exact match

Well, I learned that perhaps this may not be the way to do this. You
see, I have more than 200k rows in my polys table, and more than 5
million rows in the points table. I ran the above query last night,
went to sleep, and this morning it was still churning away (or had
frozen the 'puter). I had to kill it. Looping over each point is just
way too slow.

Another approach would be to loop over each poly and narrow my set to
all the points in it. Then run my function on each point against that
poly. However, this will also likely take very long -- one, I have to
still loop over each row, and then, when I find the points, I have
check each point (as I might have overlapping polys).

Suggestions?



--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.ies.wisc.edu/
Open Source Geospatial Foundation https://edu.osgeo.org/

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

[sqlite] Re: speeding up a query

Reply via email to