Re: [mart-dev] query performance on sequences

Rochak Neupane Wed, 08 Jul 2009 16:15:18 -0700

>From looking at the logs, it looks like when grabbing sequences, BioMart
makes 3 different types of queries.


1) query main table. transcript_id_key from transcript__main table
(exportable) - this is indexed.
Explain on query generated by biomart.

explain SELECT main.transcript_id_key, main.organism_name, main.gene_name
FROM phytozome_mart.phytozome__transcript__main main WHERE
(main.transcript_id_key = '16429347') AND (main.organism_id = '113') LIMIT
200;
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+
| id | select_type | table | type  |
possible_keys                               | key     | key_len | ref   |
rows | Extra |
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+
|  1 | SIMPLE      | main  | const |
PRIMARY,transcript_id_key,org_id,org_pac_id | PRIMARY | 4       | const |
1 |       |
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+

2) query structure table for coordinates, using transcript_id_key from step
1. transcript_id_key is indexed in this table as well.
mysql> explain SELECT main.transcript_id_key, main.transcript_id_key,
main.chr_name, main.exon_cds_start, main.exon_cds_end,
main.exon_chrom_strand, main.exon_phase, main.exon_rank, main.organism_id
FROM phytozome_mart.phytozome_structure__structure__main main WHERE
(main.transcript_id_key IN('16429347')) LIMIT 200;
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys     | key               |
key_len | ref   | rows | Extra       |
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+
|  1 | SIMPLE      | main  | ref  | transcript_id_key | transcript_id_key |
5       | const |    6 | Using where |
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+

3) finally, using the locations from step 2, query sequence mart using
chr_name, chr_start, and org_id. all three of these are indexed (index
below)
BioMart.Dataset.GenomicSequence.DNAAdaptor:176:INFO> QUERY FULL SQL:  select
sequence from sequence_phytozome__dna_chunks__main where chr_start = ? and
chr_name = ? and org_id = ?
BioMart.Dataset.GenomicSequence.DNAAdaptor:177:INFO> QUERY SUB SQL:  select
substring(sequence, ?, ?) from sequence_phytozome__dna_chunks__main where
chr_start = ? and chr_name = ? and org_id = ?

Trying an explain with values filled in:
explain select substring(sequence, 29665, 47) from
sequence_mart.sequence_phytozome__dna_chunks__main where chr_start = 200001
and chr_name = 'scaffold_406' and org_id = 77;
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+
| id | select_type | table                                | type |
possible_keys                 | key  | key_len | ref               | rows |
Extra       |
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+
|  1 | SIMPLE      | sequence_phytozome__dna_chunks__main | ref  |
chr_name,chr_start,org_id,nsi | nsi  | 52      | const,const,const |    1 |
Using where |
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+

mysql> show index from sequence_mart.sequence_phytozome__dna_chunks__main
where key_name = 'nsi';
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table                                | Non_unique | Key_name |
Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed |
Null | Index_type | Comment |
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sequence_phytozome__dna_chunks__main |          1 | nsi      |
1 | chr_name    | A         |       41014 |     NULL | NULL   |      |
BTREE      |         |
| sequence_phytozome__dna_chunks__main |          1 | nsi      |
2 | chr_start   | A         |      123043 |     NULL | NULL   | YES  |
BTREE      |         |
| sequence_phytozome__dna_chunks__main |          1 | nsi      |
3 | org_id      | A         |      123043 |     NULL | NULL   | YES  |
BTREE      |         |
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

seems like proper indexes are used when running these queries. Any way to
improve these indexes? Or any other ways to improve performance?

Thanks,
rochak

On Tue, Jul 7, 2009 at 9:47 PM, Syed Haider <[email protected]> wrote:

> Hi David,
>
> it isnt the processing of coordinates which takes major proportion of the
> time, usually its the time taken by database to return the results. Could
> you please check if indices are in place for the table that serves the
> sequence coordinates ?
>
> thanks
> Syed
>
>
> David M. Goodstein wrote:
>
>>
>>
>>
>>
>> On 7 Jul 2009, at 12:23, Syed Haider wrote:
>>
>>  Hi Rochak,
>>>
>>> Rochak Neupane wrote:
>>>
>>>> When querying for sequences on a dataset set, without any filters (so as
>>>> to grab complete set of sequences), marts seem to be quite slow. Pulling
>>>> peptides for human, for example, took an excess of 7 minutes from ensembl
>>>> mart. Grabbing peptide sequences from Caenorhabditis elegans (wormbase db,
>>>> gene dataset from biomart.org <http://biomart.org>) also took about 7
>>>> or so minutes for a file that turns out to be 9MB.
>>>> Our own mart is quite slow when querying a complete set of sequences
>>>> from an organism. Is it typical for biomart to take 7-8minutes + when
>>>> querying for whole genome sequences?
>>>>
>>>
>>> a- are you using GenomicSequence to retrieve sequences ?
>>>
>>> b- do you have ORDER BY property set on sequence exportables (if 'a' is
>>> true). Setting ORDER BY slows down the response considerably and its only
>>> required to cope with inconsistencies in the row order that should really be
>>> fixed on the mart (database) construction level.
>>>
>>> Best,
>>> Syed
>>>
>>>
>> Removing the ORDER BY does improve performance (from 15 minutes down to 5
>> minutes for an unfiltered FASTA grab of approx 30k peptides), but still not
>> really something that's user-tolerable.  Is that really the expected
>> behavior?
>>
>> --David
>>
>> David M. Goodstein
>> Joint Genome Institute / Lawrence Berkeley National Lab
>> Center for Integrative Genomics / UCBerkeley
>> http://www.phytozome.net
>>
>>
>>>
>>>  Thanks,
>>>> rochak
>>>>
>>>
>>

Re: [mart-dev] query performance on sequences

Reply via email to