Re: [sqlite] SQLite Use of Indexes

2013-07-14 Thread peter korinis
From: "James K. Lowden" <jklow...@schemamania.org>

To: sqlite-users@sqlite.org

Subject: Re: [sqlite] SQLite Use of Indexes

Message-ID: <20130712161038.b8b4df84.jklow...@schemamania.org>

Content-Type: text/plain; charset=US-ASCII

 

On Mon, 8 Jul 2013 15:32:21 -0400

"peter korinis" <kori...@earthlink.net> wrote:

 

/ thanks James, peter 

 

> . a CLAIMS table = 43M rows with indices on claim_no and

> stateCounty code; and 

> . a LINE table = 85M rows with indices on claim_no and HCPCS

> (a 5 char text code)

> . Have run ANALYZE

> . Using Win7

> Want to count occurrences of given HCPCS codes for given states.

 

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD FROM claims ,
LINE

WHERE CLAIMS.CLAIM_NO == LINE.CLAIM_NO 

--   ^^ == is not SQL
 right, a little python slipped in /

AND HCPSCD IN ( '78451'

  , '78452'

  , '78453'

  , '78454'

  , '78469'

  , '78472'

  , '78473'

  , '78481'

  , '78483'

  , '78494'

  , '78496'

  , '78499' )

AND state_cd IN ('21', '45')

GROUP BY STATE_CD

 

>From a design perpective, you might want to think about what the IN

clause is all about.  Whatever they have in common probably belongs in

a table, so you'd be able to say 

 

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD 

FROM claims, LINE, HPs

WHERE CLAIMS.CLAIM_NO = LINE.CLAIM_NO 

AND HCPSCD = HPs.code

AND HPs.category = 'foo'

AND state_cd IN ('21', '45')

GROUP BY STATE_CD

 

That suggests HCPSCD as the column to index; I suppose you have a lot

more different HCPSCDs  there are >7000 HCPSCDs ; I'm interested
in ~100 in 12 groups /

than STATE_CDs.  If the HPs table is indexed by

(category, code), then the join is on two ordered sets instead of an

extended OR.  

/// would joining 3 tables give better response ?? //

 

If HCPSCD is actually an integer, you may save space and time by

declaring it as such.  /// HCPSCD is NOT an integer //

 

HTH.

 

--jkl

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLite Use of Indexes

2013-07-12 Thread James K. Lowden
On Mon, 8 Jul 2013 15:32:21 -0400
"peter korinis"  wrote:

> . a CLAIMS table = 43M rows with indices on claim_no and
> stateCounty code; and 
> 
> . a LINE table = 85M rows with indices on claim_no and HCPCS
> (a 5 char text code)
> 
> . Have run ANALYZE
> 
> . Using Win7
> 
>  
> 
> Want to count occurrences of given HCPCS codes for given states.

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD FROM claims , LINE
WHERE CLAIMS.CLAIM_NO == LINE.CLAIM_NO 
--^^ == is not SQL
AND HCPSCD IN ( '78451'
  , '78452'
  , '78453'
  , '78454'
  , '78469'
  , '78472'
  , '78473'
  , '78481'
  , '78483'
  , '78494'
  , '78496'
  , '78499' )
AND state_cd IN ('21', '45')
GROUP BY STATE_CD

>From a design perpective, you might want to think about what the IN
clause is all about.  Whatever they have in common probably belongs in
a table, so you'd be able to say 

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD 
FROM claims, LINE, HPs
WHERE CLAIMS.CLAIM_NO = LINE.CLAIM_NO 
AND HCPSCD = HPs.code
AND HPs.category = 'foo'
AND state_cd IN ('21', '45')
GROUP BY STATE_CD

That suggests HCPSCD as the column to index; I suppose you have a lot
more different HCPSCDs than STATE_CDs.  If the HPs table is indexed by
(category, code), then the join is on two ordered sets instead of an
extended OR.  

If HCPSCD is actually an integer, you may save space and time by
declaring it as such.  

HTH.

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLite Use of Indexes

2013-07-09 Thread Simon Slavin
On 9 Jul 2013, at 6:06pm, peter korinis <kori...@earthlink.net> wrote:

> So, to implement your suggestion of crafting better indices, here’s my 
> approach:
> 1.   First action is joining the 2 tables on claim_no. {Therefore 
> claim_no should be first row in index for both tables}
> 2.   Find state and county in claims table
> 3.   Find HCPSCD in line table
> So my 2 combined indices would be:
>CREATE INDEX idx_Claim_State_Cnty ON CLAIMS (CLAIM_NO, 
> STATE_CD, CNTY_CD)
>CREATE INDEX idx_Line_hcpscd ON LINE (CLAIM_NO, HCPSCD)

Those would be good indexes.  You can see how good by using EXPLAIN QUERY PLAN 
for your SELECT.  Or better still, actually try them out and time the results.

Might be interesting to then reverse the order of the columns in each index and 
try those.  See if they’re better or worse.  Or create lots of indexes, then 
use EXPLAIN QUERY PLAN and find out which indexes SQLite decided to use, then 
delete the others.

I am deliberately not giving you an absolute answer because you have a good 
large set of data for testing and you obviously understand the idea now.  
You’ll learn more by trying out several alternatives yourself.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLite Use of Indexes

2013-07-09 Thread peter korinis
Thank you Simon for responding to my questions. Your phonebook (FName/LName)
analogy clearly explained why 2 indices per table per select won't work.

Let me provide a bit more info and a possible attempt to implement your
suggestions for better indices.

 

My 'bread and butter' query counts the frequency of HCPSCD occurrences for
each county of interest in each state of interest. [Occasionally I want
counts for entire state.] 

I generally have 6 groups of  1~15 related HCPSCD codes and want counts from
1-12 counties of a state. 

 

CLAIMS table columns: claim_no, state_cd, cnty_cd are columns of interest;
40 other columns rarely used.

LINE table columns: claim_no, hcpscd, plus 25 more columns (these are
repeating fields per claim)

 

[FYI: DB contains 44M claim numbers with 1~12 HCPSCD codes per claim,
several thousand HCPSCD codes of which I am interested in 44 in 6 groups, 50
states with 9~125 counties per state of which I am usually interested in
~10.]

 

I currently do a query like this .

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD, CNTY_CD FROM CLAIMS , LINE

WHERE CLAIMS.CLAIM_NO == LINE.CLAIM_NO AND

HCPSCD IN
('78451','78452','78453','78454','78469','78472','78473','78481','78483','78
494','78496','78499')   -- this is first group of hcpscd codes always
queried

AND

STATE_CD = '21'

AND

CNTY_CD IN ('220', '345', '570')

GROUP BY CNTY_CD

UNION {repeat above SELECT with second group of hcpscd codes
for same state and county . and so on 4 more times}

 

So, to implement your suggestion of crafting better indices, here's my
approach:

1.   First action is joining the 2 tables on claim_no. {Therefore
claim_no should be first row in index for both tables}

2.   Find state and county in claims table

3.   Find HCPSCD in line table

So my 2 combined indices would be:

CREATE INDEX idx_Claim_State_Cnty ON CLAIMS (CLAIM_NO,
STATE_CD, CNTY_CD)

CREATE INDEX idx_Line_hcpscd ON LINE (CLAIM_NO, HCPSCD)

 

Is this what you were suggesting? Will these indices produce improved
performance from the single column indices I was using?

Thanks so much.

Peter

===

> 1.   Can SQLite use > 1 index per table per SELECT? Seems like using

> both indices for each table would be better than the single index per
table

> as chosen by the plan.

 

No, you?re right: one index per table, per SELECT.  Pretend you have a phone
book with two indexes: surname and first name.  You have to look up someone
called Miri Kallas.  You can use either of the indexes to look up one name
but once you?ve done that the other index is useless to you.

 

You have made up some indexes which are useful but not the most useful.
Drop those indexes and try to figure out one compound index on each table
which would be best for the SELECT you asked about.  Remember that this

 

CREATE INDEX it1 ON t1 (c1)

CREATE INDEX it2 ON t1 (c2)

 

does not do the same thing as

 

CREATE INDEX it12 ON t1 (c1, c2)

 

Can?t do it for you because I can?t tell which of your columns are from
which table.

 

> 2.   Is using something like "SELECT . FROM LINE INDEXED BY claim_no

> AND INDEXED BY hcpscd" possible? What is the correct syntax to include 2

> INDEXED BY this way?

 

If you have to tell SQLite which index to use you?re doing it wrong.  Make
up a good index and SQLite will decide to use it.  Especially if you?ve done
ANALYZE.

 

> 3.   Is there a better way to write this query, for example, would

> rearranging the order of AND criteria in WHERE clause improve performance?

 

The query optimizer is meant to do all that for you.  However, I suspect
that you may understand your query better if you get rid of some of your
'IN' clauses.  If you imagine doing 24 (= 2 * 12) different SELECTs, one for
each State and HCPSCD, what would your SELECT look like then ?  With good
indexes it should be possible to make each of those SELECTs execute
ridiculously fast.

 

Once you?ve figured out how to do that and make it run fast, /then/ you
might want to recombine the query for each State, though perhaps not put
both States in the same query.

 

> 4.   How do I interpret the 'order' and 'from' in the query plan

> results?

 

It?s showing you what each of your indexes is being used for.  And what is
shows is the neither index is being used for both selecting records and
arranging the order of results.

 

Simon.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLite Use of Indexes

2013-07-08 Thread Simon Slavin

On 8 Jul 2013, at 8:32pm, peter korinis  wrote:

> 1.   Can SQLite use > 1 index per table per SELECT? Seems like using
> both indices for each table would be better than the single index per table
> as chosen by the plan.

No, you’re right: one index per table, per SELECT.  Pretend you have a phone 
book with two indexes: surname and first name.  You have to look up someone 
called Miri Kallas.  You can use either of the indexes to look up one name but 
once you’ve done that the other index is useless to you.

You have made up some indexes which are useful but not the most useful.  Drop 
those indexes and try to figure out one compound index on each table which 
would be best for the SELECT you asked about.  Remember that this

CREATE INDEX it1 ON t1 (c1)
CREATE INDEX it2 ON t1 (c2)

does not do the same thing as

CREATE INDEX it12 ON t1 (c1, c2)

Can’t do it for you because I can’t tell which of your columns are from which 
table.

> 2.   Is using something like "SELECT . FROM LINE INDEXED BY claim_no
> AND INDEXED BY hcpscd" possible? What is the correct syntax to include 2
> INDEXED BY this way?

If you have to tell SQLite which index to use you’re doing it wrong.  Make up a 
good index and SQLite will decide to use it.  Especially if you’ve done ANALYZE.

> 3.   Is there a better way to write this query, for example, would
> rearranging the order of AND criteria in WHERE clause improve performance?

The query optimizer is meant to do all that for you.  However, I suspect that 
you may understand your query better if you get rid of some of your 'IN' 
clauses.  If you imagine doing 24 (= 2 * 12) different SELECTs, one for each 
State and HCPSCD, what would your SELECT look like then ?  With good indexes it 
should be possible to make each of those SELECTs execute ridiculously fast.

Once you’ve figured out how to do that and make it run fast, /then/ you might 
want to recombine the query for each State, though perhaps not put both States 
in the same query.

> 4.   How do I interpret the 'order' and 'from' in the query plan
> results?

It’s showing you what each of your indexes is being used for.  And what is 
shows is the neither index is being used for both selecting records and 
arranging the order of results.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] SQLite Use of Indexes

2013-07-08 Thread peter korinis
I have a large DB and want to improve query performance.

 

Context: the DB contains 2 tables representing nationwide claim data: 

. a CLAIMS table = 43M rows with indices on claim_no and stateCounty
code; and 

. a LINE table = 85M rows with indices on claim_no and HCPCS (a 5
char text code)

. Have run ANALYZE

. Using Win7

 

Want to count occurrences of given HCPCS codes for given states.

Here is the SQL query I use:

SELECT COUNT(HCPSCD) CNT, HCPSCD, STATE_CD FROM claims , LINE

WHERE CLAIMS.CLAIM_NO == LINE.CLAIM_NO AND

HCPSCD IN
('78451','78452','78453','78454','78469','78472','78473','78481','78483','78
494','78496','78499')

AND

state_cd IN ('21', '45')

GROUP BY STATE_CD

;

Query Plan results

   Order  From

0  1  TABLE LINE WITH INDEX
idx_line_hcpscd

1  0  TABLE claims WITH INDEX
idx_claims_claim_no

 

QUESTIONS:

1.   Can SQLite use > 1 index per table per SELECT? Seems like using
both indices for each table would be better than the single index per table
as chosen by the plan.

2.   Is using something like "SELECT . FROM LINE INDEXED BY claim_no
AND INDEXED BY hcpscd" possible? What is the correct syntax to include 2
INDEXED BY this way?

3.   Is there a better way to write this query, for example, would
rearranging the order of AND criteria in WHERE clause improve performance?

4.   How do I interpret the 'order' and 'from' in the query plan
results?

 

Thanks much for your help,

peter 

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users