Re: Derby SQL ideas needed
SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
FIRST based on which sort?? First based on the existing sort, which is crucial, because the sort is by bucket ASC, activitycount DESC. I'm looking for the row with the highest activitycount, per bucket. The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. I'd be happy to post it but it may shock you. ;-) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime //FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be
Re: Derby SQL ideas needed
Looking at your proposal: SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) ... we'd be looking actually for something more like this: SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE t2.bucket = t1.bucket ) ... although I've never seen the =(SELECT...) structure before. Karl On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS
Re: Derby SQL ideas needed
You can also try ORDER BY bytecount DESC LIMIT 1 instead of aggregate function max, i.e. SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT t2.bytecount FROM (xxx) t2 WHERE t2.bucket = t1.bucket ORDER BY t2.bytecount DESC LIMIT 1 ) On Sun, Sep 19, 2010 at 9:07 PM, Karl Wright daddy...@gmail.com wrote: Looking at your proposal: SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) ... we'd be looking actually for something more like this: SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE t2.bucket = t1.bucket ) ... although I've never seen the =(SELECT...) structure before. Karl On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart
Re: Derby SQL ideas needed
And all of this is only with single table repohistory, right? Is this some kind of complex analytics/stats? On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...)
Re: Derby SQL ideas needed
Yes. This is for the Max Activity and Max Bandwidth reports. Karl On Sun, Sep 19, 2010 at 2:13 PM, Alexey Serba ase...@gmail.com wrote: And all of this is only with single table repohistory, right? Is this some kind of complex analytics/stats? On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT
Re: Derby SQL ideas needed
For what it's worth, defining a Derby function seems like the only way to do it. These seem to call arbitrary java that can accept a query as an argument and return a resultset as the result. But in order to write such a thing I will need the ability to call Derby at a java level, I think, rather than through JDBC. Still looking for a good example from somebody who has done something similar. Karl On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
The Derby table-result function syntax requires all output columns to be declared as part of the function definition, and more importantly it does not seem to allow calls into Derby itself to get results. So this would not seem to be a viable option for that reason. Back to square 1, I guess. Derby doesn't seem to allow any way to declare aggregate functions either, so I couldn't declare a FIRST() aggregate method as proposed below. Simple arithmetic functions seem like they would work, but that's not helpful here. Karl On Sat, Sep 18, 2010 at 6:45 AM, Karl Wright daddy...@gmail.com wrote: For what it's worth, defining a Derby function seems like the only way to do it. These seem to call arbitrary java that can accept a query as an argument and return a resultset as the result. But in order to write such a thing I will need the ability to call Derby at a java level, I think, rather than through JDBC. Still looking for a good example from somebody who has done something similar. Karl On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
RE: Derby/JUnit bad interaction - any ideas?
Open jdk does not seem to work properly with most java applications at this time, although it has continued to improve. Its switch incompatibilities stop it from working with ant at this time, so one cannot even build LCF with it. Karl From: ext Olivier Bourgeat [olivier.bourg...@polyspot.com] Sent: Wednesday, June 09, 2010 4:03 AM To: connectors-dev@incubator.apache.org Subject: RE: Derby/JUnit bad interaction - any ideas? Debian Lenny have openjdk-6: http://packages.debian.org/fr/source/lenny/openjdk-6 Olivier Le mardi 08 juin 2010 à 22:37 +0200, karl.wri...@nokia.com a écrit : MetaCarta is running Debian Lenny, which does not have a 1.6 version of Java available at this time. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Tuesday, June 08, 2010 4:36 PM To: connectors-dev@incubator.apache.org Subject: Re: Derby/JUnit bad interaction - any ideas? If we need to require Java 1.6, that is probably okay. I am fine with that. Does anybody have a serious objection to requiring Java 1.6 for LCF? -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Tuesday, June 08, 2010 6:35 AM To: connectors-dev@incubator.apache.org Subject: Derby/JUnit bad interaction - any ideas? I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl
Re: Derby/JUnit bad interaction - any ideas?
On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl So I've been thinking about this - I still have trouble believing this is a real problem. I had a large suite of tests that used embedded derby in a system I worked on a few years back - and I never had any trouble removing the db dir after shutting down derby. Looking at the code, have you actually tried shutting down derby? Currently you have: // Cause database to shut down new Database(context,_url+databaseName+;shutdown=true,_driver,databaseName,,); // DO NOT delete user or shutdown database, since this is in fact impossible under java 1.5 (since Derby makes its directories read-only, and // there's no way to undo that... // rm -rf databasename //File f = new File(databaseName); //recursiveDelete(f); But that is not going to do the shutdown? On a quick look, doing new Database(context, url ... does not actually contact the db - so its not going to cause it to shutdown? Is this just cruft code and you have actually tried shutting down as well? Something makes me think the delete is going to work if you actually attempt to connect with '...;shutdown=true' jdbc URL. -- - Mark http://www.lucidimagination.com
RE: Derby/JUnit bad interaction - any ideas?
This actually did work, oddly enough. I wonder how Derby is undoing the read-only attribute on those directories? But in any case, I'm revamping the core setup/shutdown code again so that there's a decent hook in place to do the derby shutdown. Karl -Original Message- From: ext Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, June 09, 2010 4:26 PM To: connectors-dev@incubator.apache.org Subject: Re: Derby/JUnit bad interaction - any ideas? On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl So I've been thinking about this - I still have trouble believing this is a real problem. I had a large suite of tests that used embedded derby in a system I worked on a few years back - and I never had any trouble removing the db dir after shutting down derby. Looking at the code, have you actually tried shutting down derby? Currently you have: // Cause database to shut down new Database(context,_url+databaseName+;shutdown=true,_driver,databaseName,,); // DO NOT delete user or shutdown database, since this is in fact impossible under java 1.5 (since Derby makes its directories read-only, and // there's no way to undo that... // rm -rf databasename //File f = new File(databaseName); //recursiveDelete(f); But that is not going to do the shutdown? On a quick look, doing new Database(context, url ... does not actually contact the db - so its not going to cause it to shutdown? Is this just cruft code and you have actually tried shutting down as well? Something makes me think the delete is going to work if you actually attempt to connect with '...;shutdown=true' jdbc URL. -- - Mark http://www.lucidimagination.com
Re: Derby/JUnit bad interaction - any ideas?
On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl Interesting - when I worked with derby in the past, I never had any trouble deleting a database after shutting it down on windows using Java 5. It worked great with my unit tests. You could always run each test in a new system tmp dir every time... I find it hard to believe you cannot delete the database somehow though - like I said, I never had any problems with it using embedded derby in the past after shutting down the db. -- - Mark http://www.lucidimagination.com
Re: Derby/JUnit bad interaction - any ideas?
(10/06/08 22:35), karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl Hi Karl, If it is possible, Ant chmod task can be used, or you can consult the implementation. But Ant manual says for the task: Right now it has effect only under Unix or NonStop Kernel (Tandem). http://ant.apache.org/manual/Tasks/chmod.html Koji -- http://www.rondhuit.com/en/
RE: Derby/JUnit bad interaction - any ideas?
Huh. I wonder how ant is doing it? Using the ant task directly makes it impossible to do this from within JUnit, of course, but maybe the same hack can be done inside the test stuff. Karl -Original Message- From: ext Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Tuesday, June 08, 2010 10:08 AM To: connectors-dev@incubator.apache.org Subject: Re: Derby/JUnit bad interaction - any ideas? (10/06/08 22:35), karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl Hi Karl, If it is possible, Ant chmod task can be used, or you can consult the implementation. But Ant manual says for the task: Right now it has effect only under Unix or NonStop Kernel (Tandem). http://ant.apache.org/manual/Tasks/chmod.html Koji -- http://www.rondhuit.com/en/
RE: Derby/JUnit bad interaction - any ideas?
Yeah, I was pretty surprised too. But on windows it is likely that File.makeReadOnly() (which is what Derby must be using) doesn't actually do anything to directories, which would explain the discrepancy. Karl -Original Message- From: ext Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, June 08, 2010 9:45 AM To: connectors-dev@incubator.apache.org Subject: Re: Derby/JUnit bad interaction - any ideas? On 6/8/10 6:35 AM, karl.wri...@nokia.com wrote: I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl Interesting - when I worked with derby in the past, I never had any trouble deleting a database after shutting it down on windows using Java 5. It worked great with my unit tests. You could always run each test in a new system tmp dir every time... I find it hard to believe you cannot delete the database somehow though - like I said, I never had any problems with it using embedded derby in the past after shutting down the db. -- - Mark http://www.lucidimagination.com
Re: Derby/JUnit bad interaction - any ideas?
(10/06/08 23:14), karl.wri...@nokia.com wrote: Yeah, I was pretty surprised too. But on windows it is likely that File.makeReadOnly() (which is what Derby must be using) doesn't actually do anything to directories, which would explain the discrepancy. Karl If so, luckily Ant hack can solve the problem on Linux. Koji -- http://www.rondhuit.com/en/
RE: Derby/JUnit bad interaction - any ideas?
I just had a look at the sources. Ant's chmod task queries what kind of OS it is, and if it is the right kind, it actually attempts to fire off the chmod utility. ;-) That's pretty hacky. Nice to avoid that if possible. Now, I was able to get my current set of brain-dead tests to work OK (and the ant cleanup too!) by making sure that the database was properly cleaned after every use, and leaving it around for later. It turns out that ant can delete the testing directory even though the directory underneath it has read-only stuff in it, even without the chmod. This seems to be because when it fails any deletion, it simply calls f.deleteOnExit() and lets the JVM do it later - and apparently the JVM *can* do this, because it's implemented to just do an unlink at that time, which bypasses the need to actually delete any read-only subdirectories. Oh my. What a strange mess. Still, things are currently working, so I guess I'll leave them as they are, for now. Karl -Original Message- From: ext Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Tuesday, June 08, 2010 10:30 AM To: connectors-dev@incubator.apache.org Subject: Re: Derby/JUnit bad interaction - any ideas? (10/06/08 23:14), karl.wri...@nokia.com wrote: Yeah, I was pretty surprised too. But on windows it is likely that File.makeReadOnly() (which is what Derby must be using) doesn't actually do anything to directories, which would explain the discrepancy. Karl If so, luckily Ant hack can solve the problem on Linux. Koji -- http://www.rondhuit.com/en/
Re: Derby/JUnit bad interaction - any ideas?
If we need to require Java 1.6, that is probably okay. I am fine with that. Does anybody have a serious objection to requiring Java 1.6 for LCF? -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Tuesday, June 08, 2010 6:35 AM To: connectors-dev@incubator.apache.org Subject: Derby/JUnit bad interaction - any ideas? I've been trying to get some basic tests working under Junit. Unfortunately, I've run into a Derby problem which prevents these tests from working. What happens is this. Derby, when it creates a database, forces a number of directories within the database to read-only. Unfortunately, unless we stipulate Java 1.6 or up, there is no native Java way to make these directories become non-read-only. So database cleanup always fails to actually remove the old database, and then new database creation subsequently fails. So there are two possibilities. First, we can change things so we never actually try to clean up the Derby DB. Second, we can mandate the java 1.6 is used for LCF. That's all there really is. The first possibility is tricky but doable - I think. The second would probably be unacceptable in many ways. Thoughts? Karl
RE: Derby
The reason this occurs is because I am using Derby in embedded mode, and the restriction appears to be a limitation of that mode of operation. However, this mode is necessary to meet the testing goal, which was the prime motivator behind doing a Derby implementation. I am sure that if we were to use Derby as a service, the restriction would no longer apply, but then there would be no conceivable benefit either. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Friday, June 04, 2010 12:41 AM To: connectors-dev@incubator.apache.org Subject: Re: Derby What is the nature of the single LCF process issue? Is it because the database is being used in single-user mode, or some other issue? Is it a permanent issue, or is there a solution or workaround anticipated at some stage. Thanks. -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 5:34 PM To: connectors-dev@incubator.apache.org Subject: Derby For what it's worth, after some 5 days of work, and a couple of schema changes to boot, LCF now runs with Derby. Some caveats: (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. (2) I haven't tested every query, so I'm sure there are probably some that are still broken. (3) It's slow. Count yourself as fortunate if it runs 1/5 the rate of Postgresql for you. (4) Transactional integrity hasn't been evaluated. (5) Deadlock detection and unique constraint violation detection is probably not right, because I'd need to cause these errors to occur before being able to key off their exception messages. (6) I had to turn off the ability to sort on certain columns in the reports - basically, any column that was represented as a large character field. Nevertheless, this represents an important milestone on the path to being able to write some kind of unit tests that have at least some meaning. If you have an existing LCF Postgresql database, you will need to force an upgrade after going to the new trunk code. To do this, repeat the org.apache.lcf.agents.Install command, and the org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent command after deploying the new code. And, please, let me know of any kind of errors you notice that could be related to the schema change. Thanks, Karl
RE: Derby
Yup. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Friday, June 04, 2010 12:27 AM To: connectors-dev@incubator.apache.org Subject: Re: Derby Just to be clear, the full sequence would be: 1) Start UI app. Agent process should not be running. 2) Start LCF job in UI. 3) Shutdown UI app. Not just close the browser window. 4) AgentRun. 5) Wait long enough for crawl to have finished. Maybe watch to see that Solr has become idle. 6) Possibly commit to Solr. 7) AgentStop. 8) Back to step 1 for additional jobs. Correct? -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 7:24 PM To: connectors-dev@incubator.apache.org Subject: RE: Derby The daemon does not need to interact with the UI directly, only with the database. So, you stop the UI, start the daemon, and after a while, shut down the daemon and restart the UI. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Thursday, June 03, 2010 5:51 PM To: connectors-dev@incubator.apache.org Subject: Re: Derby (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. How do you Start a crawl then if not in the web app which then starts the agent process crawling? Thanks for all of this effort! -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 5:34 PM To: connectors-dev@incubator.apache.org Subject: Derby For what it's worth, after some 5 days of work, and a couple of schema changes to boot, LCF now runs with Derby. Some caveats: (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. (2) I haven't tested every query, so I'm sure there are probably some that are still broken. (3) It's slow. Count yourself as fortunate if it runs 1/5 the rate of Postgresql for you. (4) Transactional integrity hasn't been evaluated. (5) Deadlock detection and unique constraint violation detection is probably not right, because I'd need to cause these errors to occur before being able to key off their exception messages. (6) I had to turn off the ability to sort on certain columns in the reports - basically, any column that was represented as a large character field. Nevertheless, this represents an important milestone on the path to being able to write some kind of unit tests that have at least some meaning. If you have an existing LCF Postgresql database, you will need to force an upgrade after going to the new trunk code. To do this, repeat the org.apache.lcf.agents.Install command, and the org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent command after deploying the new code. And, please, let me know of any kind of errors you notice that could be related to the schema change. Thanks, Karl
RE: Derby
The daemon does not need to interact with the UI directly, only with the database. So, you stop the UI, start the daemon, and after a while, shut down the daemon and restart the UI. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Thursday, June 03, 2010 5:51 PM To: connectors-dev@incubator.apache.org Subject: Re: Derby (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. How do you Start a crawl then if not in the web app which then starts the agent process crawling? Thanks for all of this effort! -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 5:34 PM To: connectors-dev@incubator.apache.org Subject: Derby For what it's worth, after some 5 days of work, and a couple of schema changes to boot, LCF now runs with Derby. Some caveats: (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. (2) I haven't tested every query, so I'm sure there are probably some that are still broken. (3) It's slow. Count yourself as fortunate if it runs 1/5 the rate of Postgresql for you. (4) Transactional integrity hasn't been evaluated. (5) Deadlock detection and unique constraint violation detection is probably not right, because I'd need to cause these errors to occur before being able to key off their exception messages. (6) I had to turn off the ability to sort on certain columns in the reports - basically, any column that was represented as a large character field. Nevertheless, this represents an important milestone on the path to being able to write some kind of unit tests that have at least some meaning. If you have an existing LCF Postgresql database, you will need to force an upgrade after going to the new trunk code. To do this, repeat the org.apache.lcf.agents.Install command, and the org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent command after deploying the new code. And, please, let me know of any kind of errors you notice that could be related to the schema change. Thanks, Karl
Re: Derby
Just to be clear, the full sequence would be: 1) Start UI app. Agent process should not be running. 2) Start LCF job in UI. 3) Shutdown UI app. Not just close the browser window. 4) AgentRun. 5) Wait long enough for crawl to have finished. Maybe watch to see that Solr has become idle. 6) Possibly commit to Solr. 7) AgentStop. 8) Back to step 1 for additional jobs. Correct? -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 7:24 PM To: connectors-dev@incubator.apache.org Subject: RE: Derby The daemon does not need to interact with the UI directly, only with the database. So, you stop the UI, start the daemon, and after a while, shut down the daemon and restart the UI. Karl -Original Message- From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] Sent: Thursday, June 03, 2010 5:51 PM To: connectors-dev@incubator.apache.org Subject: Re: Derby (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. How do you Start a crawl then if not in the web app which then starts the agent process crawling? Thanks for all of this effort! -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 5:34 PM To: connectors-dev@incubator.apache.org Subject: Derby For what it's worth, after some 5 days of work, and a couple of schema changes to boot, LCF now runs with Derby. Some caveats: (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. (2) I haven't tested every query, so I'm sure there are probably some that are still broken. (3) It's slow. Count yourself as fortunate if it runs 1/5 the rate of Postgresql for you. (4) Transactional integrity hasn't been evaluated. (5) Deadlock detection and unique constraint violation detection is probably not right, because I'd need to cause these errors to occur before being able to key off their exception messages. (6) I had to turn off the ability to sort on certain columns in the reports - basically, any column that was represented as a large character field. Nevertheless, this represents an important milestone on the path to being able to write some kind of unit tests that have at least some meaning. If you have an existing LCF Postgresql database, you will need to force an upgrade after going to the new trunk code. To do this, repeat the org.apache.lcf.agents.Install command, and the org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent command after deploying the new code. And, please, let me know of any kind of errors you notice that could be related to the schema change. Thanks, Karl
Re: Derby
What is the nature of the single LCF process issue? Is it because the database is being used in single-user mode, or some other issue? Is it a permanent issue, or is there a solution or workaround anticipated at some stage. Thanks. -- Jack Krupansky -- From: karl.wri...@nokia.com Sent: Thursday, June 03, 2010 5:34 PM To: connectors-dev@incubator.apache.org Subject: Derby For what it's worth, after some 5 days of work, and a couple of schema changes to boot, LCF now runs with Derby. Some caveats: (1) You can't run more than one LCF process at a time. That means you need to either run the daemon or the crawler-ui web application, but you can't run both at the same time. (2) I haven't tested every query, so I'm sure there are probably some that are still broken. (3) It's slow. Count yourself as fortunate if it runs 1/5 the rate of Postgresql for you. (4) Transactional integrity hasn't been evaluated. (5) Deadlock detection and unique constraint violation detection is probably not right, because I'd need to cause these errors to occur before being able to key off their exception messages. (6) I had to turn off the ability to sort on certain columns in the reports - basically, any column that was represented as a large character field. Nevertheless, this represents an important milestone on the path to being able to write some kind of unit tests that have at least some meaning. If you have an existing LCF Postgresql database, you will need to force an upgrade after going to the new trunk code. To do this, repeat the org.apache.lcf.agents.Install command, and the org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent command after deploying the new code. And, please, let me know of any kind of errors you notice that could be related to the schema change. Thanks, Karl