Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-11 Thread Shigeki Kobayashi
Hi Karl.

I could build the source ok but the following code is missing
from connectors.xml. Does this mean I built it incorrectly or this is on
purpose?
Do I have to just add the code to enable the Windows share connection?


  repositoryconnector name=Windows shares
class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/


Regards,

Shigeki


2012/12/11 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
 up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
 This has a good chance of fixing the query performance problem.
 Please try it out, and let me know if you still get slow queries in
 the log.  You should be to use the existing database instance.

 Thanks,
 Karl

 On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
  Experiments here indicate that FORCE INDEX seems to do what we need.
 
  I'm going to think about it a bit and then come up with a fix that
  should use FORCE INDEX in this situation.  Then we can see if it
  actually helps for you.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
  Sorry, the FORCE INDEX hint requires the name of the index.  Since
  ManifoldCF does not assign index names to fixed values, you will need
  to find the right one, by using the SHOW INDEX command first to get
  the right index's name.
 
  Apologies,
  Karl
 
 
  On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com
 wrote:
  Ok, that is unfortunate.  I will do some further MySQL research here.
  There is a FORCE INDEX MySQL construct that may help, e.g.
 
  SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
 
  which we can also try.  In this case that would be: FORCE INDEX
  (docpriority,status,checkaction,checktime) or FORCE INDEX
  (docpriority_status_checkaction_checktime)  - unclear what the right
  syntax actually is.  Maybe you can try an explain with that in the
  query?
 
  FWIW, PostgreSQL should always use the index for this situation.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl,
 
  Thanks for the reply.
 
  I did EXPLAIN as following:
 
  mysql explain SELECT
  -
  t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
  - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN
 ('P','G')
  - AND t0.checkaction='R' AND
  - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1
 WHERE
  - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
 AND NOT
  - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
 AND
  t2.status
  - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
  EXISTS(SELECT
  - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
  - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
  - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 
 +++---++--++-+-++-+
  | id | select_type| table | type   | possible_keys
  | key| key_len | ref | rows   | Extra
  |
 
 +++---++--++-+-++-+
  |  1 | PRIMARY| t0| range  |
  I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25
|
  NULL| 151494 | Using where; Using filesort |
  |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
  | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
  |
  |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
  | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using
 index
  |
  |  3 | DEPENDENT SUBQUERY | t2| ref|
  I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122
   |
  manifoldcf.t0.dochash   |  1 | Using where |
  |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
  | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using
 where
  |
 
 +++---++--++-+-++-+
 
 
  As you see Using filesort, I do not think it uses the index.
 
  By the way, which database do you recommend for the case of crawling
  a
  humongous number of files for now? PostgreSQL?
 
 
  Regards,
 
  Shigeki
 
  2012/12/10 Karl Wright daddy...@gmail.com
 
  Since you have a large table, can you try an EXPLAIN for the
 following
  query, which should match the explanation given here:
  http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
  Does it use the index?
 
  SELECT
 

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-11 Thread Shigeki Kobayashi
Sorry, My bad.

jcifs.jar was missing. Probably this is the cause.

Sorry


Shigeki


2012/12/11 Shigeki Kobayashi shigeki.kobayas...@g.softbank.co.jp

 Hi Karl.

 I could build the source ok but the following code is missing
 from connectors.xml. Does this mean I built it incorrectly or this is on
 purpose?
 Do I have to just add the code to enable the Windows share connection?


   repositoryconnector name=Windows shares
 class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/


 Regards,

 Shigeki


 2012/12/11 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
 up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
 This has a good chance of fixing the query performance problem.
 Please try it out, and let me know if you still get slow queries in
 the log.  You should be to use the existing database instance.

 Thanks,
 Karl

 On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
  Experiments here indicate that FORCE INDEX seems to do what we need.
 
  I'm going to think about it a bit and then come up with a fix that
  should use FORCE INDEX in this situation.  Then we can see if it
  actually helps for you.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com
 wrote:
  Sorry, the FORCE INDEX hint requires the name of the index.  Since
  ManifoldCF does not assign index names to fixed values, you will need
  to find the right one, by using the SHOW INDEX command first to get
  the right index's name.
 
  Apologies,
  Karl
 
 
  On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com
 wrote:
  Ok, that is unfortunate.  I will do some further MySQL research here.
  There is a FORCE INDEX MySQL construct that may help, e.g.
 
  SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
 
  which we can also try.  In this case that would be: FORCE INDEX
  (docpriority,status,checkaction,checktime) or FORCE INDEX
  (docpriority_status_checkaction_checktime)  - unclear what the right
  syntax actually is.  Maybe you can try an explain with that in the
  query?
 
  FWIW, PostgreSQL should always use the index for this situation.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl,
 
  Thanks for the reply.
 
  I did EXPLAIN as following:
 
  mysql explain SELECT
  -
  t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
  - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN
 ('P','G')
  - AND t0.checkaction='R' AND
  - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs
 t1 WHERE
  - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
 AND NOT
  - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
 t2.dochash=t0.dochash AND
  t2.status
  - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
  EXISTS(SELECT
  - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
  - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
  - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 
 +++---++--++-+-++-+
  | id | select_type| table | type   | possible_keys
  | key| key_len | ref | rows   | Extra
  |
 
 +++---++--++-+-++-+
  |  1 | PRIMARY| t0| range  |
  I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25
|
  NULL| 151494 | Using where; Using filesort |
  |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
  | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
  |
  |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
  | PRIMARY| 767 | manifoldcf.t3.eventname |  1 |
 Using index
  |
  |  3 | DEPENDENT SUBQUERY | t2| ref|
  I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122
 |
  manifoldcf.t0.dochash   |  1 | Using where |
  |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
  | PRIMARY| 8   | manifoldcf.t0.jobid |  1 |
 Using where
  |
 
 +++---++--++-+-++-+
 
 
  As you see Using filesort, I do not think it uses the index.
 
  By the way, which database do you recommend for the case of crawling
  a
  humongous number of files for now? PostgreSQL?
 
 
  Regards,
 
  Shigeki
 
  2012/12/10 Karl Wright daddy...@gmail.com
 
  Since you have a large table, can you try an EXPLAIN for the
 

RE: Too many slow queries caused by MCF running MySQL 5.5

2012-12-11 Thread Karl Wright
You just need to run ant make-deps too before building.

Karl

Sent from my Windows Phone
--
From: Shigeki Kobayashi
Sent: 12/11/2012 3:58 AM
To: user@manifoldcf.apache.org
Subject: Re: Too many slow queries caused by MCF running MySQL 5.5

Hi Karl.

I could build the source ok but the following code is missing
from connectors.xml. Does this mean I built it incorrectly or this is on
purpose?
Do I have to just add the code to enable the Windows share connection?


  repositoryconnector name=Windows shares
class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/


Regards,

Shigeki


2012/12/11 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
 up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
 This has a good chance of fixing the query performance problem.
 Please try it out, and let me know if you still get slow queries in
 the log.  You should be to use the existing database instance.

 Thanks,
 Karl

 On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
  Experiments here indicate that FORCE INDEX seems to do what we need.
 
  I'm going to think about it a bit and then come up with a fix that
  should use FORCE INDEX in this situation.  Then we can see if it
  actually helps for you.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
  Sorry, the FORCE INDEX hint requires the name of the index.  Since
  ManifoldCF does not assign index names to fixed values, you will need
  to find the right one, by using the SHOW INDEX command first to get
  the right index's name.
 
  Apologies,
  Karl
 
 
  On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com
 wrote:
  Ok, that is unfortunate.  I will do some further MySQL research here.
  There is a FORCE INDEX MySQL construct that may help, e.g.
 
  SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
 
  which we can also try.  In this case that would be: FORCE INDEX
  (docpriority,status,checkaction,checktime) or FORCE INDEX
  (docpriority_status_checkaction_checktime)  - unclear what the right
  syntax actually is.  Maybe you can try an explain with that in the
  query?
 
  FWIW, PostgreSQL should always use the index for this situation.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl,
 
  Thanks for the reply.
 
  I did EXPLAIN as following:
 
  mysql explain SELECT
  -
  t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
  - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN
 ('P','G')
  - AND t0.checkaction='R' AND
  - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1
 WHERE
  - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
 AND NOT
  - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
 AND
  t2.status
  - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
  EXISTS(SELECT
  - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
  - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
  - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 
 +++---++--++-+-++-+
  | id | select_type| table | type   | possible_keys
  | key| key_len | ref | rows   | Extra
  |
 
 +++---++--++-+-++-+
  |  1 | PRIMARY| t0| range  |
  I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25
|
  NULL| 151494 | Using where; Using filesort |
  |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
  | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
  |
  |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
  | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using
 index
  |
  |  3 | DEPENDENT SUBQUERY | t2| ref|
  I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122
   |
  manifoldcf.t0.dochash   |  1 | Using where |
  |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
  | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using
 where
  |
 
 +++---++--++-+-++-+
 
 
  As you see Using filesort, I do not think it uses the index.
 
  By the way, which database do you recommend for the case of crawling
  a
  humongous number of files for now? PostgreSQL?
 
 
  Regards,
 
  Shigeki
 
  2012/12/10

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-11 Thread Shigeki Kobayashi
Hi Karl.

I downloaded the new MCF dev and crawled files again.
There are still the same slow queries. The follwoing is the result of
explain against the slow query:


mysql explain SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 FORCE INDEX (i1355220194891) WHERE t0.status IN ('P','G')
AND t0.checkaction='R' AND t0.checktime=1355221199007 AND EXISTS(SELECT
'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
ASC,t0.status ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
+++---++--++-+-+---+-+
| id | select_type| table | type   | possible_keys
   | key| key_len | ref | rows
 | Extra   |
+++---++--++-+-+---+-+
|  1 | PRIMARY| t0| ALL| NULL
  | NULL   | NULL| NULL| 85293
| Using where; Using filesort |
|  4 | DEPENDENT SUBQUERY | t3| ref| I1355220194897
  | I1355220194897 | 8   | manifoldcf.t0.id| 1
| |
|  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
   | PRIMARY| 767 | manifoldcf.t3.eventname | 1
| Using index |
|  3 | DEPENDENT SUBQUERY | t2| ref|
I1355220194890,I1355220194893,I1355220194892 | I1355220194890 | 122 |
manifoldcf.t0.dochash   | 1 | Using where |
|  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1355220194900
  | PRIMARY| 8   | manifoldcf.t0.jobid | 1
| Using where |
+++---++--++-+-+---+-+
5 rows in set (0.00 sec)


Full table scanning was happening still.


Regards,

Shigeki

2012/12/11 Karl Wright daddy...@gmail.com

 You just need to run ant make-deps too before building.

 Karl

 Sent from my Windows Phone
 --
 From: Shigeki Kobayashi
 Sent: 12/11/2012 3:58 AM
 To: user@manifoldcf.apache.org
 Subject: Re: Too many slow queries caused by MCF running MySQL 5.5

 Hi Karl.

 I could build the source ok but the following code is missing
 from connectors.xml. Does this mean I built it incorrectly or this is on
 purpose?
 Do I have to just add the code to enable the Windows share connection?


   repositoryconnector name=Windows shares
 class=org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector/


 Regards,

 Shigeki


 2012/12/11 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
 up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
 This has a good chance of fixing the query performance problem.
 Please try it out, and let me know if you still get slow queries in
 the log.  You should be to use the existing database instance.

 Thanks,
 Karl

 On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
  Experiments here indicate that FORCE INDEX seems to do what we need.
 
  I'm going to think about it a bit and then come up with a fix that
  should use FORCE INDEX in this situation.  Then we can see if it
  actually helps for you.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com
 wrote:
  Sorry, the FORCE INDEX hint requires the name of the index.  Since
  ManifoldCF does not assign index names to fixed values, you will need
  to find the right one, by using the SHOW INDEX command first to get
  the right index's name.
 
  Apologies,
  Karl
 
 
  On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com
 wrote:
  Ok, that is unfortunate.  I will do some further MySQL research here.
  There is a FORCE INDEX MySQL construct that may help, e.g.
 
  SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
 
  which we can also try.  In this case that would be: FORCE INDEX
  (docpriority,status,checkaction,checktime) or FORCE INDEX
  (docpriority_status_checkaction_checktime)  - unclear what the right
  syntax actually is.  Maybe you can try an explain with that in the
  query?
 
  FWIW, PostgreSQL should always use the index for this situation.
 
  Karl
 
 
  On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
  Hi Karl

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Since you have a large table, can you try an EXPLAIN for the following
query, which should match the explanation given here:
http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
Does it use the index?

SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
AND t0.checkaction='R' AND
t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

Thanks!
Karl

On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Shigeki,

 The rules for when a database will use an index for an ORDER BY clause
 differ significantly from database to database.  The current logic
 seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
  I will see if I can find a solution.  The ticket for this
 CONNECTORS-584.

 Karl

 On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.


 I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
 I tried to crawl 10 million files using Windows share connection and index
 them into Solr.

 As MCF reached over 1 million files, the crawling speed started getting
 slower.
 So I checked slow queries and found out that too many slow queries occurred,
 especially the following kinds:

 
 # Time: 121204 16:25:40
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200  Rows_examined:
 611091
 SET timestamp=1354605940;
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 # Time: 121204 16:25:44
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 3.064339  Lock_time: 0.84 Rows_sent: 1  Rows_examined:
 406359
 SET timestamp=1354605944;
 SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
 ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT
 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)  ORDER BY
 docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
 ---

 I wonder if the queries appropriately use index of the table.
 As a result of EXPLAIN against the slow query, there was filesort.
 There seems to be some conditions that MySQL does not use index depending on
 ORDER BY:
  - Executing ORDER BY against multiple keys
  - When keys selected from records are different from keys used by ORDER BY

 Since filesort was happening, fully scanning records should be having MCF
 slower.

 Do you think this could happen even in PostgreSQL or HSQLDB?
 Do you think queries could be modified to use index appropriately?


 Regards,

 Shigeki


Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Shigeki Kobayashi
Hi Karl,

Thanks for the reply.

I did EXPLAIN as following:

mysql explain SELECT
- t0.id
,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
- FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
- AND t0.checkaction='R' AND
- t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
- t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
- EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
t2.status
- IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
EXISTS(SELECT
- 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
- t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
- ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
+++---++--++-+-++-+
| id | select_type| table | type   | possible_keys
   | key| key_len | ref | rows
  | Extra   |
+++---++--++-+-++-+
|  1 | PRIMARY| t0| range  |
I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
NULL| 151494 | Using where; Using filesort |
|  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
  | I1354241297077 | 8   | manifoldcf.t0.id|  1
| |
|  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
   | PRIMARY| 767 | manifoldcf.t3.eventname |
 1 | Using index |
|  3 | DEPENDENT SUBQUERY | t2| ref|
I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
manifoldcf.t0.dochash   |  1 | Using where |
|  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
  | PRIMARY| 8   | manifoldcf.t0.jobid |  1
| Using where |
+++---++--++-+-++-+


As you see Using filesort, I do not think it uses the index.

By the way, which database do you recommend for the case of crawling  a
humongous number of files for now? PostgreSQL?


Regards,

Shigeki

2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT
 t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
   I will see if I can find a solution.  The ticket for this
  CONNECTORS-584.
 
  Karl
 
  On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
 
  I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
  I tried to crawl 10 million files using Windows share connection and
 index
  them into Solr.
 
  As MCF reached over 1 million files, the crawling speed started getting
  slower.
  So I checked slow queries and found out that too many slow queries
 occurred,
  especially the following kinds:
 
  
  # Time: 121204 16:25:40
  # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
  # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200
  Rows_examined:
  611091
  SET timestamp=1354605940;
  SELECT
  t0.id
 ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
  FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
  t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
  t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
  EXISTS(SELECT 'x' FROM jobqueue 

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Ok, that is unfortunate.  I will do some further MySQL research here.
There is a FORCE INDEX MySQL construct that may help, e.g.

SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

which we can also try.  In this case that would be: FORCE INDEX
(docpriority,status,checkaction,checktime) or FORCE INDEX
(docpriority_status_checkaction_checktime)  - unclear what the right
syntax actually is.  Maybe you can try an explain with that in the
query?

FWIW, PostgreSQL should always use the index for this situation.

Karl


On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
   I will see if I can find a solution.  The ticket for this
  CONNECTORS-584.
 
  Karl
 
  On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
 
  I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
  I tried to crawl 10 million files using Windows share connection and
  index
  them into Solr.
 
  As MCF reached over 1 million files, the crawling speed started getting
  slower.
  So I checked slow queries and found out that too many slow queries
  occurred,
  especially the following kinds:
 
  
  # Time: 121204 

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Sorry, the FORCE INDEX hint requires the name of the index.  Since
ManifoldCF does not assign index names to fixed values, you will need
to find the right one, by using the SHOW INDEX command first to get
the right index's name.

Apologies,
Karl


On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
   I will see if I can find a solution.  The ticket for this
  CONNECTORS-584.
 
  Karl
 
  On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
  shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
 
  I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
  I tried to crawl 10 million files using Windows 

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Experiments here indicate that FORCE INDEX seems to do what we need.

I'm going to think about it a bit and then come up with a fix that
should use FORCE INDEX in this situation.  Then we can see if it
actually helps for you.

Karl


On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
 Sorry, the FORCE INDEX hint requires the name of the index.  Since
 ManifoldCF does not assign index names to fixed values, you will need
 to find the right one, by using the SHOW INDEX command first to get
 the right index's name.

 Apologies,
 Karl


 On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200

 Thanks!
 Karl

 On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright daddy...@gmail.com wrote:
  Hi Shigeki,
 
  The rules for when a database will use an index for an ORDER BY clause
  differ significantly from database to database.  The current logic
  seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
 

Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-10 Thread Karl Wright
Hi Shigeki,

I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
This has a good chance of fixing the query performance problem.
Please try it out, and let me know if you still get slow queries in
the log.  You should be to use the existing database instance.

Thanks,
Karl

On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright daddy...@gmail.com wrote:
 Experiments here indicate that FORCE INDEX seems to do what we need.

 I'm going to think about it a bit and then come up with a fix that
 should use FORCE INDEX in this situation.  Then we can see if it
 actually helps for you.

 Karl


 On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright daddy...@gmail.com wrote:
 Sorry, the FORCE INDEX hint requires the name of the index.  Since
 ManifoldCF does not assign index names to fixed values, you will need
 to find the right one, by using the SHOW INDEX command first to get
 the right index's name.

 Apologies,
 Karl


 On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, that is unfortunate.  I will do some further MySQL research here.
 There is a FORCE INDEX MySQL construct that may help, e.g.

 SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

 which we can also try.  In this case that would be: FORCE INDEX
 (docpriority,status,checkaction,checktime) or FORCE INDEX
 (docpriority_status_checkaction_checktime)  - unclear what the right
 syntax actually is.  Maybe you can try an explain with that in the
 query?

 FWIW, PostgreSQL should always use the index for this situation.

 Karl


 On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 Hi Karl,

 Thanks for the reply.

 I did EXPLAIN as following:

 mysql explain SELECT
 -
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 - FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN 
 ('P','G')
 - AND t0.checkaction='R' AND
 - t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 - t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 - EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 - IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
 EXISTS(SELECT
 - 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 - t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 - ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 +++---++--++-+-++-+
 | id | select_type| table | type   | possible_keys
 | key| key_len | ref | rows   | Extra
 |
 +++---++--++-+-++-+
 |  1 | PRIMARY| t0| range  |
 I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25  |
 NULL| 151494 | Using where; Using filesort |
 |  4 | DEPENDENT SUBQUERY | t3| ref| I1354241297077
 | I1354241297077 | 8   | manifoldcf.t0.id|  1 |
 |
 |  4 | DEPENDENT SUBQUERY | t4| eq_ref | PRIMARY
 | PRIMARY| 767 | manifoldcf.t3.eventname |  1 | Using index
 |
 |  3 | DEPENDENT SUBQUERY | t2| ref|
 I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122 |
 manifoldcf.t0.dochash   |  1 | Using where |
 |  2 | DEPENDENT SUBQUERY | t1| eq_ref | PRIMARY,I1354241297080
 | PRIMARY| 8   | manifoldcf.t0.jobid |  1 | Using where
 |
 +++---++--++-+-++-+


 As you see Using filesort, I do not think it uses the index.

 By the way, which database do you recommend for the case of crawling  a
 humongous number of files for now? PostgreSQL?


 Regards,

 Shigeki

 2012/12/10 Karl Wright daddy...@gmail.com

 Since you have a large table, can you try an EXPLAIN for the following
 query, which should match the explanation given here:
 http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
 Does it use the index?

 SELECT

 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.docpriority = 0 AND t0.status IN ('P','G')
 AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
 t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE 

Too many slow queries caused by MCF running MySQL 5.5

2012-12-09 Thread Shigeki Kobayashi
Hi.


I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
I tried to crawl 10 million files using Windows share connection and index
them into Solr.

As MCF reached over 1 million files, the crawling speed started getting
slower.
So I checked slow queries and found out that too many slow queries
occurred, especially the following kinds:


# Time: 121204 16:25:40
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200  Rows_examined:
611091
SET timestamp=1354605940;
SELECT 
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
# Time: 121204 16:25:44
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 3.064339  Lock_time: 0.84 Rows_sent: 1  Rows_examined:
406359
SET timestamp=1354605944;
SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
('P','G') AND checkaction='R' AND checktime=1354605932817 AND
EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)
 ORDER BY docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
---

I wonder if the queries appropriately use index of the table.
As a result of EXPLAIN against the slow query, there was filesort.
There seems to be some conditions that MySQL does not use index depending
on ORDER BY:
 - Executing ORDER BY against multiple keys
 - When keys selected from records are different from keys used by ORDER BY

Since filesort was happening, fully scanning records should be having MCF
slower.

Do you think this could happen even in PostgreSQL or HSQLDB?
Do you think queries could be modified to use index appropriately?


Regards,

Shigeki


Re: Too many slow queries caused by MCF running MySQL 5.5

2012-12-09 Thread Karl Wright
Hi Shigeki,

The rules for when a database will use an index for an ORDER BY clause
differ significantly from database to database.  The current logic
seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
 I will see if I can find a solution.  The ticket for this
CONNECTORS-584.

Karl

On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.


 I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
 I tried to crawl 10 million files using Windows share connection and index
 them into Solr.

 As MCF reached over 1 million files, the crawling speed started getting
 slower.
 So I checked slow queries and found out that too many slow queries occurred,
 especially the following kinds:

 
 # Time: 121204 16:25:40
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200  Rows_examined:
 611091
 SET timestamp=1354605940;
 SELECT
 t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
 FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
 t0.checktime=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
 t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
 EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status
 IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
 t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
 ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
 # Time: 121204 16:25:44
 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
 # Query_time: 3.064339  Lock_time: 0.84 Rows_sent: 1  Rows_examined:
 406359
 SET timestamp=1354605944;
 SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
 ('P','G') AND checkaction='R' AND checktime=1354605932817 AND EXISTS(SELECT
 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)  ORDER BY
 docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
 ---

 I wonder if the queries appropriately use index of the table.
 As a result of EXPLAIN against the slow query, there was filesort.
 There seems to be some conditions that MySQL does not use index depending on
 ORDER BY:
  - Executing ORDER BY against multiple keys
  - When keys selected from records are different from keys used by ORDER BY

 Since filesort was happening, fully scanning records should be having MCF
 slower.

 Do you think this could happen even in PostgreSQL or HSQLDB?
 Do you think queries could be modified to use index appropriately?


 Regards,

 Shigeki