Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching each 
 distinct idbucket result.
FIRST based on which sort?

Lets say you want to return FIRST row based on t3.windowstart column
and you have primary key in t3 table. Then I believe your query can be
rewritten in the following ways:

1. Using subqueries
SELECT
bucket, primary_key, windowstart, etc
FROM
table AS t1
WHERE
windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
bucket = t1.bucket )

2. Using joins instead of subqueries ( in case Derby doesn't support
subqueries - not sure about that )
SELECT
t1.bucket, t1.primary_key, windowstart, etc
FROM
table AS t1
LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
t2.windowstart  t1.windowstart )
WHERE
t2.primary_key IS NULL

HTH,
Alex

On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl



Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
FIRST based on which sort??

First based on the existing sort, which is crucial, because the sort
is by bucket ASC, activitycount DESC.  I'm looking for the row with
the highest activitycount, per bucket.

The other thing is that we cannot afford to use the same table
twice, as it is actually an extremely expensive query in its own
right, with multiple joins, select distinct's, etc. under the covers.
I'd be happy to post it but it may shock you. ;-)

Karl





On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl




Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
Even if you create indexes on bucket and activitycount columns? It
might be that the query plans for these two queries (with distinct
on hack and subquery max/subquery order limit/join) would be the
same.

 I'd be happy to post it but it may shock you. ;-)
The way I indent SQL queries should say that I'm not afraid of
multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl





Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Here you go:

// The query we will generate here looks like this:
// SELECT *
//   FROM
// (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
t3.bytecount AS bytecount,
//   t3.windowstart AS starttime,
t3.windowend AS endtime
//FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
t0.starttime AS windowstart, t0.starttime + interval AS windowend,
//   SUM(t1.datasize * ((case when t0.starttime +
interval  t1.endtime then t0.starttime + interval else t1.endtime
end) -
// (case when t0.starttimet1.starttime then
t0.starttime else t1.starttime end))
//  / (t1.endtime - t1.starttime)) AS bytecount
//   FROM (SELECT DISTINCT substring(entityid from
'bucketregexp') AS bucket, starttime FROM repohistory WHERE
criteria) t0, repohistory t1
//   WHERE t0.bucket=substring(t1.entityid from
'bucket_regexp')
//  AND t1.starttime  t0.starttime +
interval AND t1.endtime  t0.starttime
//  AND criteria on t1
//  GROUP BY bucket,windowstart,windowend
//  UNION SELECT t0a.bucket AS bucket, t0a.endtime -
interval AS windowstart, t0a.endtime AS windowend,
//   SUM(t1a.datasize * ((case when t0a.endtime 
t1a.endtime then t0a.endtime else t1a.endtime end) -
// (case when t0a.endtime - interval 
t1a.starttime then t0a.endtime - interval else t1a.starttime end))
//  / (t1a.endtime - t1a.starttime)) AS bytecount
//   FROM (SELECT DISTINCT substring(entityid from
'bucketregexp') AS bucket, endtime FROM repohistory WHERE
criteria) t0a, repohistory t1a
//   WHERE t0a.bucket=substring(t1a.entityid from
'bucket_regexp')
//  AND (t1a.starttime  t0a.endtime AND
t1a.endtime  t0a.endtime - interval
//  AND criteria on t1a
//  GROUP BY bucket,windowstart,windowend) t2
//  ORDER BY bucket ASC,bytecount
DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

I have low confidence that ANY planner would be able to locate the
common part of a 2x larger query and not do it twice.

Karl



On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be 

Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Looking at your proposal:

SELECT
   bucket, primary_key, windowstart, etc
FROM
   table AS t1
WHERE
   windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
bucket = t1.bucket )

... we'd be looking actually for something more like this:


SELECT
   t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
FROM
   (xxx) t1
WHERE
   t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE
t2.bucket = t1.bucket )

... although I've never seen the =(SELECT...) structure before.

Karl


On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS 

Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
You can also try ORDER BY bytecount DESC LIMIT 1 instead of aggregate
function max, i.e.

SELECT
t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
FROM
(xxx) t1
WHERE
t1.bytecount=( SELECT t2.bytecount FROM (xxx) t2 WHERE t2.bucket =
t1.bucket ORDER BY t2.bytecount DESC LIMIT 1 )

On Sun, Sep 19, 2010 at 9:07 PM, Karl Wright daddy...@gmail.com wrote:
 Looking at your proposal:

 SELECT
   bucket, primary_key, windowstart, etc
 FROM
   table AS t1
 WHERE
   windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 ... we'd be looking actually for something more like this:


 SELECT
   t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
 FROM
   (xxx) t1
 WHERE
   t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE
 t2.bucket = t1.bucket )

 ... although I've never seen the =(SELECT...) structure before.

 Karl


 On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  

Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
And all of this is only with single table repohistory, right? Is this
some kind of complex analytics/stats?

On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) 

Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Yes.  This is for the Max Activity and Max Bandwidth reports.
Karl

On Sun, Sep 19, 2010 at 2:13 PM, Alexey Serba ase...@gmail.com wrote:
 And all of this is only with single table repohistory, right? Is this
 some kind of complex analytics/stats?

 On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT 

Re: Derby SQL ideas needed

2010-09-18 Thread Karl Wright
For what it's worth, defining a Derby function seems like the only way
to do it.  These seem to call arbitrary java that can accept a query
as an argument and return a resultset as the result.  But in order to
write such a thing I will need the ability to call Derby at a java
level, I think, rather than through JDBC.  Still looking for a good
example from somebody who has done something similar.

Karl

On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl



Re: Derby SQL ideas needed

2010-09-18 Thread Karl Wright
The Derby table-result function syntax requires all output columns to
be declared as part of the function definition, and more importantly
it does not seem to allow calls into Derby itself to get results.  So
this would not seem to be a viable option for that reason.

Back to square 1, I guess.  Derby doesn't seem to allow any way to
declare aggregate functions either, so I couldn't declare a FIRST()
aggregate method as proposed below.  Simple arithmetic functions seem
like they would work, but that's not helpful here.

Karl



On Sat, Sep 18, 2010 at 6:45 AM, Karl Wright daddy...@gmail.com wrote:
 For what it's worth, defining a Derby function seems like the only way
 to do it.  These seem to call arbitrary java that can accept a query
 as an argument and return a resultset as the result.  But in order to
 write such a thing I will need the ability to call Derby at a java
 level, I think, rather than through JDBC.  Still looking for a good
 example from somebody who has done something similar.

 Karl

 On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl




RE: Derby/JUnit bad interaction - any ideas?

2010-06-09 Thread karl.wright
Open jdk does not seem to work properly with most java applications at this 
time, although it has continued to improve.  Its switch incompatibilities stop 
it from working with ant at this time, so one cannot even build LCF with it.

Karl


From: ext Olivier Bourgeat [olivier.bourg...@polyspot.com]
Sent: Wednesday, June 09, 2010 4:03 AM
To: connectors-dev@incubator.apache.org
Subject: RE: Derby/JUnit bad interaction - any ideas?

Debian Lenny have openjdk-6:
http://packages.debian.org/fr/source/lenny/openjdk-6

Olivier

Le mardi 08 juin 2010 à 22:37 +0200, karl.wri...@nokia.com a écrit :
 MetaCarta is running Debian Lenny, which does not have a 1.6 version of Java 
 available at this time.

 Karl


 -Original Message-
 From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
 Sent: Tuesday, June 08, 2010 4:36 PM
 To: connectors-dev@incubator.apache.org
 Subject: Re: Derby/JUnit bad interaction - any ideas?

 If we need to require Java 1.6, that is probably okay. I am fine with that.
 Does anybody have a serious objection to requiring Java 1.6 for LCF?

 -- Jack Krupansky

 --
 From: karl.wri...@nokia.com
 Sent: Tuesday, June 08, 2010 6:35 AM
 To: connectors-dev@incubator.apache.org
 Subject: Derby/JUnit bad interaction - any ideas?

  I've been trying to get some basic tests working under Junit.
  Unfortunately, I've run into a Derby problem which prevents these tests
  from working.
 
  What happens is this.  Derby, when it creates a database, forces a number
  of directories within the database to read-only.  Unfortunately, unless
  we stipulate Java 1.6 or up, there is no native Java way to make these
  directories become non-read-only.  So database cleanup always fails to
  actually remove the old database, and then new database creation
  subsequently fails.
 
  So there are two possibilities.  First, we can change things so we never
  actually try to clean up the Derby DB.  Second, we can mandate the java
  1.6 is used for LCF.  That's all there really is.
 
  The first possibility is tricky but doable - I think.  The second would
  probably be unacceptable in many ways.
 
  Thoughts?
 
  Karl
 
 
 
 






Re: Derby/JUnit bad interaction - any ideas?

2010-06-09 Thread Mark Miller

On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote:

I've been trying to get some basic tests working under Junit.  Unfortunately, 
I've run into a Derby problem which prevents these tests from working.

What happens is this.  Derby, when it creates a database, forces a number of directories 
within the database to read-only.  Unfortunately, unless we stipulate Java 
1.6 or up, there is no native Java way to make these directories become non-read-only.  
So database cleanup always fails to actually remove the old database, and then new 
database creation subsequently fails.

So there are two possibilities.  First, we can change things so we never 
actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 is 
used for LCF.  That's all there really is.

The first possibility is tricky but doable - I think.  The second would 
probably be unacceptable in many ways.

Thoughts?

Karl






So I've been thinking about this - I still have trouble believing this 
is a real problem. I had a large suite of tests that used embedded derby 
in a system I worked on a few years back - and I never had any trouble 
removing the db dir after shutting down derby.


Looking at the code, have you actually tried shutting down derby?

Currently you have:

// Cause database to shut down
new 
Database(context,_url+databaseName+;shutdown=true,_driver,databaseName,,);
// DO NOT delete user or shutdown database, since this is in fact 
impossible under java 1.5 (since Derby makes its directories read-only, and

// there's no way to undo that...
// rm -rf databasename
//File f = new File(databaseName);
//recursiveDelete(f);

But that is not going to do the shutdown?
On a quick look, doing new Database(context, url ...
does not actually contact the db - so its not going to cause it to shutdown?

Is this just cruft code and you have actually tried shutting down as well?

Something makes me think the delete is going to work if you actually 
attempt to connect with '...;shutdown=true' jdbc URL.


--
- Mark

http://www.lucidimagination.com


RE: Derby/JUnit bad interaction - any ideas?

2010-06-09 Thread karl.wright
This actually did work, oddly enough.  I wonder how Derby is undoing the 
read-only attribute on those directories?  But in any case, I'm revamping the 
core setup/shutdown code again so that there's a decent hook in place to do the 
derby shutdown.

Karl


-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Wednesday, June 09, 2010 4:26 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby/JUnit bad interaction - any ideas?

On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote:
 I've been trying to get some basic tests working under Junit.  Unfortunately, 
 I've run into a Derby problem which prevents these tests from working.

 What happens is this.  Derby, when it creates a database, forces a number of 
 directories within the database to read-only.  Unfortunately, unless we 
 stipulate Java 1.6 or up, there is no native Java way to make these 
 directories become non-read-only.  So database cleanup always fails to 
 actually remove the old database, and then new database creation subsequently 
 fails.

 So there are two possibilities.  First, we can change things so we never 
 actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 
 is used for LCF.  That's all there really is.

 The first possibility is tricky but doable - I think.  The second would 
 probably be unacceptable in many ways.

 Thoughts?

 Karl





So I've been thinking about this - I still have trouble believing this 
is a real problem. I had a large suite of tests that used embedded derby 
in a system I worked on a few years back - and I never had any trouble 
removing the db dir after shutting down derby.

Looking at the code, have you actually tried shutting down derby?

Currently you have:

 // Cause database to shut down
 new 
Database(context,_url+databaseName+;shutdown=true,_driver,databaseName,,);
 // DO NOT delete user or shutdown database, since this is in fact 
impossible under java 1.5 (since Derby makes its directories read-only, and
 // there's no way to undo that...
 // rm -rf databasename
 //File f = new File(databaseName);
 //recursiveDelete(f);

But that is not going to do the shutdown?
On a quick look, doing new Database(context, url ...
does not actually contact the db - so its not going to cause it to shutdown?

Is this just cruft code and you have actually tried shutting down as well?

Something makes me think the delete is going to work if you actually 
attempt to connect with '...;shutdown=true' jdbc URL.

-- 
- Mark

http://www.lucidimagination.com


Re: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread Mark Miller

On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote:

I've been trying to get some basic tests working under Junit.  Unfortunately, 
I've run into a Derby problem which prevents these tests from working.

What happens is this.  Derby, when it creates a database, forces a number of directories 
within the database to read-only.  Unfortunately, unless we stipulate Java 
1.6 or up, there is no native Java way to make these directories become non-read-only.  
So database cleanup always fails to actually remove the old database, and then new 
database creation subsequently fails.

So there are two possibilities.  First, we can change things so we never 
actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 is 
used for LCF.  That's all there really is.

The first possibility is tricky but doable - I think.  The second would 
probably be unacceptable in many ways.

Thoughts?

Karl






Interesting - when I worked with derby in the past, I never had any 
trouble deleting a database after shutting it down on windows using Java 
5. It worked great with my unit tests.


You could always run each test in a new system tmp dir every time...

I find it hard to believe you cannot delete the database somehow though 
- like I said, I never had any problems with it using embedded derby in 
the past after shutting down the db.


--
- Mark

http://www.lucidimagination.com


Re: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread Koji Sekiguchi

(10/06/08 22:35), karl.wri...@nokia.com wrote:

I've been trying to get some basic tests working under Junit.  Unfortunately, 
I've run into a Derby problem which prevents these tests from working.

What happens is this.  Derby, when it creates a database, forces a number of directories 
within the database to read-only.  Unfortunately, unless we stipulate Java 
1.6 or up, there is no native Java way to make these directories become non-read-only.  
So database cleanup always fails to actually remove the old database, and then new 
database creation subsequently fails.

So there are two possibilities.  First, we can change things so we never 
actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 is 
used for LCF.  That's all there really is.

The first possibility is tricky but doable - I think.  The second would 
probably be unacceptable in many ways.

Thoughts?

Karl
   

Hi Karl,

If it is possible, Ant chmod task can be used, or
you can consult the implementation. But Ant manual
says for the task:

 Right now it has effect only under Unix or NonStop Kernel (Tandem).
http://ant.apache.org/manual/Tasks/chmod.html

Koji

--
http://www.rondhuit.com/en/



RE: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread karl.wright
Huh.  I wonder how ant is doing it?

Using the ant task directly makes it impossible to do this from within JUnit, 
of course, but maybe the same hack can be done inside the test stuff.

Karl

-Original Message-
From: ext Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Tuesday, June 08, 2010 10:08 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby/JUnit bad interaction - any ideas?

(10/06/08 22:35), karl.wri...@nokia.com wrote:
 I've been trying to get some basic tests working under Junit.  Unfortunately, 
 I've run into a Derby problem which prevents these tests from working.

 What happens is this.  Derby, when it creates a database, forces a number of 
 directories within the database to read-only.  Unfortunately, unless we 
 stipulate Java 1.6 or up, there is no native Java way to make these 
 directories become non-read-only.  So database cleanup always fails to 
 actually remove the old database, and then new database creation subsequently 
 fails.

 So there are two possibilities.  First, we can change things so we never 
 actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 
 is used for LCF.  That's all there really is.

 The first possibility is tricky but doable - I think.  The second would 
 probably be unacceptable in many ways.

 Thoughts?

 Karl

Hi Karl,

If it is possible, Ant chmod task can be used, or
you can consult the implementation. But Ant manual
says for the task:

 Right now it has effect only under Unix or NonStop Kernel (Tandem).
http://ant.apache.org/manual/Tasks/chmod.html

Koji

-- 
http://www.rondhuit.com/en/



RE: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread karl.wright
Yeah, I was pretty surprised too.  But on windows it is likely that 
File.makeReadOnly() (which is what Derby must be using) doesn't actually do 
anything to directories, which would explain the discrepancy.

Karl


-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Tuesday, June 08, 2010 9:45 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby/JUnit bad interaction - any ideas?

On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote:
 I've been trying to get some basic tests working under Junit.  Unfortunately, 
 I've run into a Derby problem which prevents these tests from working.

 What happens is this.  Derby, when it creates a database, forces a number of 
 directories within the database to read-only.  Unfortunately, unless we 
 stipulate Java 1.6 or up, there is no native Java way to make these 
 directories become non-read-only.  So database cleanup always fails to 
 actually remove the old database, and then new database creation subsequently 
 fails.

 So there are two possibilities.  First, we can change things so we never 
 actually try to clean up the Derby DB.  Second, we can mandate the java 1.6 
 is used for LCF.  That's all there really is.

 The first possibility is tricky but doable - I think.  The second would 
 probably be unacceptable in many ways.

 Thoughts?

 Karl





Interesting - when I worked with derby in the past, I never had any 
trouble deleting a database after shutting it down on windows using Java 
5. It worked great with my unit tests.

You could always run each test in a new system tmp dir every time...

I find it hard to believe you cannot delete the database somehow though 
- like I said, I never had any problems with it using embedded derby in 
the past after shutting down the db.

-- 
- Mark

http://www.lucidimagination.com


Re: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread Koji Sekiguchi

(10/06/08 23:14), karl.wri...@nokia.com wrote:

Yeah, I was pretty surprised too.  But on windows it is likely that 
File.makeReadOnly() (which is what Derby must be using) doesn't actually do 
anything to directories, which would explain the discrepancy.

Karl

   

If so, luckily Ant hack can solve the problem on Linux.

Koji

--
http://www.rondhuit.com/en/



RE: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread karl.wright
I just had a look at the sources.  Ant's chmod task queries what kind of OS it 
is, and if it is the right kind, it actually attempts to fire off the chmod 
utility. ;-)

That's pretty hacky.  Nice to avoid that if possible.

Now, I was able to get my current set of brain-dead tests to work OK (and the 
ant cleanup too!) by making sure that the database was properly cleaned after 
every use, and leaving it around for later.  It turns out that ant can delete 
the testing directory even though the directory underneath it has read-only 
stuff in it, even without the chmod.  This seems to be because when it fails 
any deletion, it simply calls f.deleteOnExit() and lets the JVM do it later - 
and apparently the JVM *can* do this, because it's implemented to just do an 
unlink at that time, which bypasses the need to actually delete any read-only 
subdirectories.

Oh my.  What a strange mess.

Still, things are currently working, so I guess I'll leave them as they are, 
for now.

Karl


-Original Message-
From: ext Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Tuesday, June 08, 2010 10:30 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby/JUnit bad interaction - any ideas?

(10/06/08 23:14), karl.wri...@nokia.com wrote:
 Yeah, I was pretty surprised too.  But on windows it is likely that 
 File.makeReadOnly() (which is what Derby must be using) doesn't actually do 
 anything to directories, which would explain the discrepancy.

 Karl


If so, luckily Ant hack can solve the problem on Linux.

Koji

-- 
http://www.rondhuit.com/en/



Re: Derby/JUnit bad interaction - any ideas?

2010-06-08 Thread Jack Krupansky
If we need to require Java 1.6, that is probably okay. I am fine with that. 
Does anybody have a serious objection to requiring Java 1.6 for LCF?


-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Tuesday, June 08, 2010 6:35 AM
To: connectors-dev@incubator.apache.org
Subject: Derby/JUnit bad interaction - any ideas?

I've been trying to get some basic tests working under Junit. 
Unfortunately, I've run into a Derby problem which prevents these tests 
from working.


What happens is this.  Derby, when it creates a database, forces a number 
of directories within the database to read-only.  Unfortunately, unless 
we stipulate Java 1.6 or up, there is no native Java way to make these 
directories become non-read-only.  So database cleanup always fails to 
actually remove the old database, and then new database creation 
subsequently fails.


So there are two possibilities.  First, we can change things so we never 
actually try to clean up the Derby DB.  Second, we can mandate the java 
1.6 is used for LCF.  That's all there really is.


The first possibility is tricky but doable - I think.  The second would 
probably be unacceptable in many ways.


Thoughts?

Karl






RE: Derby

2010-06-04 Thread karl.wright
The reason this occurs is because I am using Derby in embedded mode, and the 
restriction appears to be a limitation of that mode of operation.  However, 
this mode is necessary to meet the testing goal, which was the prime motivator 
behind doing a Derby implementation.  I am sure that if we were to use Derby as 
a service, the restriction would no longer apply, but then there would be no 
conceivable benefit either.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, June 04, 2010 12:41 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby

What is the nature of the single LCF process issue? Is it because the 
database is being used in single-user mode, or some other issue? Is it a 
permanent issue, or is there a solution or workaround anticipated at some 
stage.

Thanks.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 5:34 PM
To: connectors-dev@incubator.apache.org
Subject: Derby

 For what it's worth, after some 5 days of work, and a couple of schema 
 changes to boot, LCF now runs with Derby.
 Some caveats:

 (1) You can't run more than one LCF process at a time.  That means you 
 need to either run the daemon or the crawler-ui web application, but you 
 can't run both at the same time.
 (2) I haven't tested every query, so I'm sure there are probably some 
 that are still broken.
 (3) It's slow.  Count yourself as fortunate if it runs 1/5 the rate of 
 Postgresql for you.
 (4) Transactional integrity hasn't been evaluated.
 (5) Deadlock detection and unique constraint violation detection is 
 probably not right, because I'd need to cause these errors to occur before 
 being able to key off their exception messages.
 (6) I had to turn off the ability to sort on certain columns in the 
 reports - basically, any column that was represented as a large character 
 field.

 Nevertheless, this represents an important milestone on the path to being 
 able to write some kind of unit tests that have at least some meaning.

 If you have an existing LCF Postgresql database, you will need to force an 
 upgrade after going to the new trunk code.  To do this, repeat the 
 org.apache.lcf.agents.Install command, and the 
 org.apache.lcf.agents.Register 
 org.apache.lcf.crawler.system.CrawlerAgent command after deploying the 
 new code.  And, please, let me know of any kind of errors you notice that 
 could be related to the schema change.

 Thanks,
 Karl


 


RE: Derby

2010-06-04 Thread karl.wright
Yup.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, June 04, 2010 12:27 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby

Just to be clear, the full sequence would be:

1) Start UI app. Agent process should not be running.
2) Start LCF job in UI.
3) Shutdown UI app. Not just close the browser window.
4) AgentRun.
5) Wait long enough for crawl to have finished. Maybe watch to see that Solr 
has become idle.
6) Possibly commit to Solr.
7) AgentStop.
8) Back to step 1 for additional jobs.

Correct?

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 7:24 PM
To: connectors-dev@incubator.apache.org
Subject: RE: Derby

 The daemon does not need to interact with the UI directly, only with the 
 database.  So, you stop the UI, start the daemon, and after a while, shut 
 down the daemon and restart the UI.

 Karl

 -Original Message-
 From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
 Sent: Thursday, June 03, 2010 5:51 PM
 To: connectors-dev@incubator.apache.org
 Subject: Re: Derby

 (1) You can't run more than one LCF process at a time.  That means 
 you
 need to either run the daemon or the crawler-ui web application, but you
 can't run both at the same time.

 How do you Start a crawl then if not in the web app which then starts 
 the
 agent process crawling?

 Thanks for all of this effort!

 -- Jack Krupansky

 --
 From: karl.wri...@nokia.com
 Sent: Thursday, June 03, 2010 5:34 PM
 To: connectors-dev@incubator.apache.org
 Subject: Derby

 For what it's worth, after some 5 days of work, and a couple of schema
 changes to boot, LCF now runs with Derby.
 Some caveats:

 (1) You can't run more than one LCF process at a time.  That means 
 you
 need to either run the daemon or the crawler-ui web application, but you
 can't run both at the same time.
 (2) I haven't tested every query, so I'm sure there are probably some
 that are still broken.
 (3) It's slow.  Count yourself as fortunate if it runs 1/5 the rate 
 of
 Postgresql for you.
 (4) Transactional integrity hasn't been evaluated.
 (5) Deadlock detection and unique constraint violation detection is
 probably not right, because I'd need to cause these errors to occur 
 before
 being able to key off their exception messages.
 (6) I had to turn off the ability to sort on certain columns in the
 reports - basically, any column that was represented as a large character
 field.

 Nevertheless, this represents an important milestone on the path to being
 able to write some kind of unit tests that have at least some meaning.

 If you have an existing LCF Postgresql database, you will need to force 
 an
 upgrade after going to the new trunk code.  To do this, repeat the
 org.apache.lcf.agents.Install command, and the
 org.apache.lcf.agents.Register
 org.apache.lcf.crawler.system.CrawlerAgent command after deploying the
 new code.  And, please, let me know of any kind of errors you notice that
 could be related to the schema change.

 Thanks,
 Karl


 


RE: Derby

2010-06-03 Thread karl.wright
The daemon does not need to interact with the UI directly, only with the 
database.  So, you stop the UI, start the daemon, and after a while, shut down 
the daemon and restart the UI.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Thursday, June 03, 2010 5:51 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby

 (1) You can't run more than one LCF process at a time.  That means you 
 need to either run the daemon or the crawler-ui web application, but you 
 can't run both at the same time.

How do you Start a crawl then if not in the web app which then starts the 
agent process crawling?

Thanks for all of this effort!

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 5:34 PM
To: connectors-dev@incubator.apache.org
Subject: Derby

 For what it's worth, after some 5 days of work, and a couple of schema 
 changes to boot, LCF now runs with Derby.
 Some caveats:

 (1) You can't run more than one LCF process at a time.  That means you 
 need to either run the daemon or the crawler-ui web application, but you 
 can't run both at the same time.
 (2) I haven't tested every query, so I'm sure there are probably some 
 that are still broken.
 (3) It's slow.  Count yourself as fortunate if it runs 1/5 the rate of 
 Postgresql for you.
 (4) Transactional integrity hasn't been evaluated.
 (5) Deadlock detection and unique constraint violation detection is 
 probably not right, because I'd need to cause these errors to occur before 
 being able to key off their exception messages.
 (6) I had to turn off the ability to sort on certain columns in the 
 reports - basically, any column that was represented as a large character 
 field.

 Nevertheless, this represents an important milestone on the path to being 
 able to write some kind of unit tests that have at least some meaning.

 If you have an existing LCF Postgresql database, you will need to force an 
 upgrade after going to the new trunk code.  To do this, repeat the 
 org.apache.lcf.agents.Install command, and the 
 org.apache.lcf.agents.Register 
 org.apache.lcf.crawler.system.CrawlerAgent command after deploying the 
 new code.  And, please, let me know of any kind of errors you notice that 
 could be related to the schema change.

 Thanks,
 Karl


 


Re: Derby

2010-06-03 Thread Jack Krupansky

Just to be clear, the full sequence would be:

1) Start UI app. Agent process should not be running.
2) Start LCF job in UI.
3) Shutdown UI app. Not just close the browser window.
4) AgentRun.
5) Wait long enough for crawl to have finished. Maybe watch to see that Solr 
has become idle.

6) Possibly commit to Solr.
7) AgentStop.
8) Back to step 1 for additional jobs.

Correct?

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 7:24 PM
To: connectors-dev@incubator.apache.org
Subject: RE: Derby

The daemon does not need to interact with the UI directly, only with the 
database.  So, you stop the UI, start the daemon, and after a while, shut 
down the daemon and restart the UI.


Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
Sent: Thursday, June 03, 2010 5:51 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Derby

(1) You can't run more than one LCF process at a time.  That means 
you

need to either run the daemon or the crawler-ui web application, but you
can't run both at the same time.


How do you Start a crawl then if not in the web app which then starts 
the

agent process crawling?

Thanks for all of this effort!

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 5:34 PM
To: connectors-dev@incubator.apache.org
Subject: Derby


For what it's worth, after some 5 days of work, and a couple of schema
changes to boot, LCF now runs with Derby.
Some caveats:

(1) You can't run more than one LCF process at a time.  That means 
you

need to either run the daemon or the crawler-ui web application, but you
can't run both at the same time.
(2) I haven't tested every query, so I'm sure there are probably some
that are still broken.
(3) It's slow.  Count yourself as fortunate if it runs 1/5 the rate 
of

Postgresql for you.
(4) Transactional integrity hasn't been evaluated.
(5) Deadlock detection and unique constraint violation detection is
probably not right, because I'd need to cause these errors to occur 
before

being able to key off their exception messages.
(6) I had to turn off the ability to sort on certain columns in the
reports - basically, any column that was represented as a large character
field.

Nevertheless, this represents an important milestone on the path to being
able to write some kind of unit tests that have at least some meaning.

If you have an existing LCF Postgresql database, you will need to force 
an

upgrade after going to the new trunk code.  To do this, repeat the
org.apache.lcf.agents.Install command, and the
org.apache.lcf.agents.Register
org.apache.lcf.crawler.system.CrawlerAgent command after deploying the
new code.  And, please, let me know of any kind of errors you notice that
could be related to the schema change.

Thanks,
Karl





Re: Derby

2010-06-03 Thread Jack Krupansky
What is the nature of the single LCF process issue? Is it because the 
database is being used in single-user mode, or some other issue? Is it a 
permanent issue, or is there a solution or workaround anticipated at some 
stage.


Thanks.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Thursday, June 03, 2010 5:34 PM
To: connectors-dev@incubator.apache.org
Subject: Derby

For what it's worth, after some 5 days of work, and a couple of schema 
changes to boot, LCF now runs with Derby.

Some caveats:

(1) You can't run more than one LCF process at a time.  That means you 
need to either run the daemon or the crawler-ui web application, but you 
can't run both at the same time.
(2) I haven't tested every query, so I'm sure there are probably some 
that are still broken.
(3) It's slow.  Count yourself as fortunate if it runs 1/5 the rate of 
Postgresql for you.

(4) Transactional integrity hasn't been evaluated.
(5) Deadlock detection and unique constraint violation detection is 
probably not right, because I'd need to cause these errors to occur before 
being able to key off their exception messages.
(6) I had to turn off the ability to sort on certain columns in the 
reports - basically, any column that was represented as a large character 
field.


Nevertheless, this represents an important milestone on the path to being 
able to write some kind of unit tests that have at least some meaning.


If you have an existing LCF Postgresql database, you will need to force an 
upgrade after going to the new trunk code.  To do this, repeat the 
org.apache.lcf.agents.Install command, and the 
org.apache.lcf.agents.Register 
org.apache.lcf.crawler.system.CrawlerAgent command after deploying the 
new code.  And, please, let me know of any kind of errors you notice that 
could be related to the schema change.


Thanks,
Karl