On Sun, 17 Jun 2007, Jeff Frost wrote:

On Mon, 18 Jun 2007, Tom Lane wrote:

Jeff Frost <[EMAIL PROTECTED]> writes:
[ umpteen million iterations of: ]
              ->  Limit  (cost=0.00..367.09 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
                          Index Cond: (stamp < $1)
Filter: ((stamp IS NOT NULL) AND (visit_id = $0))

Perhaps an index on (visit_id, stamp) would help.  This one is doing the
best it can, but if the visit_id's you want are thinly scattered, it'll
still suck...

Good idea Tom!  In fact the planner seems to like that much better:

Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)
vs
Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)

Now that this is working in a reasonable amount of time to process the entire data set, I need to work out how to process the new information that comes in every so often and still assign it a correct visit_id until the fix for this can get through QA. The cleanup query looks like this:

CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM reporting.page_view_clean;

INSERT INTO reporting.page_view_clean
SELECT id
     ,
       CASE
       WHEN stamp - last_stamp > INTERVAL '1 hour'
    OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq')
       ELSE currval('reporting.tracking_cleanup_seq')
       END AS visit_id
     , uri
     , params
     , stamp
     , visit_id AS old_visit_id
  FROM (
    SELECT id
        , visit_id
        , uri
        , params
        , stamp
        , (
        SELECT MAX(pv2.stamp)
          FROM page_view pv2
        WHERE pv2.visit_id            = pv1.visit_id
          AND pv2.stamp               < pv1.stamp
          ) AS last_stamp
      FROM page_view pv1
    WHERE pv1.stamp               < now() - INTERVAL '1 hour'
      AND pv1.id                  > (
        SELECT CASE
                      WHEN id IS NULL THEN 0
                      ELSE id
                      END AS id
          FROM tmpmaxpvid
          )
    ORDER BY pv1.visit_id
        , pv1.stamp
       ) x;


The problem is the page_views that straddle the time before and after the query get an artificially different visit_id.

This case statement is what kills me:

       CASE
       WHEN stamp - last_stamp > INTERVAL '1 hour'
    OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq')
       ELSE currval('reporting.tracking_cleanup_seq')
       END AS visit_id

If I change it to this:

       CASE
       WHEN last_stamp IS NULL
         THEN visit_id
       WHEN stamp - last_stamp > INTERVAL '1 hour'
         THEN nextval('reporting.tracking_cleanup_seq')
       ELSE currval('reporting.tracking_cleanup_seq')
       END AS visit_id

Then it gives the first page_view in the sequence the correct visit_id, but then continues on with a different visit_id for the rest.

I've also tried changing how I select which data to act on like so:

    WHERE pv1.id                  > (
        SELECT CASE
              WHEN id IS NULL THEN 0
              ELSE id
              END AS id
          FROM tmpmaxpvid
          )
      AND pv1.visit_id IN (
        SELECT visit_id
          FROM page_view pv3
        GROUP BY visit_id
        HAVING max(stamp)              < now() - INTERVAL '1 hour'
          )
    ORDER BY pv1.visit_id
        , pv1.stamp
       ) x;

But that leaves me skipping some page views because they haven't reached their conclusion and because of the id > clause, I'll never go back to see them.

Anyone have any suggestions?

--
Jeff Frost, Owner       <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

Reply via email to