Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-29 Thread Dennis Cote

Tito Ciuro wrote:



Due to application requirements, I must rely on LIKE and GLOB in  
order to match data, matching data that contains some value  
(sensitive or insensitive match).


Now, it seems to me that using either LIKE or GLOB will force a row  
scan anyhow, since it can't use the index, correct? So your solution  
would force me to perform the row scan for 250K rows when matching  
values. That is 10 times more data to scan through.




Tito,

I'm glad to see you are looking at this.

First, the attribute value table is storing the same string data in my 
proposal and your value table. In my case there is some overhead for the 
additional rowids (3 bytes for 250K rows), and the FileId column (2 
bytes for 25K files), but in your case you have the overhead of the 
field separator commas (13 bytes in your example with 14 attributes). My 
table has 250K short rows, and yours has 25K long rows. There are 10 
times as many rows in my table, but your rows are about 10 times a 
large. There may be a small difference in the total amount of data, but 
it is not going to be significant one way or the other.


The real heart of the problem is your requirement to match data using 
LIKE due to "application requirements". Can you expand on that statement 
so that I can understand the details of your requirements?


From your examples, what you are doing is using LIKE to do equality 
tests for a subfield  in your large value column. With the values 
separated into separate rows there is no need to use LIKE to do that.


  WHERE CMValues LIKE '%2004-12-16 10:11:35  -0800%'

would become

   WHERE Attribute.Value =  '2004-12-16 10:11:35  -0800'

If you need case insensitive searches you simply convert the value 
string to the same case as your test string in the equality test. Or if 
you are unsure of the case of the test string, you can convert them both.


  WHERE upper(Attribute.Value) =  'ABIWORD.PROFILE'

or

  WHERE upper(Attribute.Value) =  upper(:match_value)

Now, these comparisons will scan all the values, since they don't 
specify which attribute you want to check. This may be what you want if 
you are searching for a date and you don't care if it matches the create 
date or the modification date, but generally that is not what you want 
to do (Why look at file size or album name data when looking for a 
date?). Usually you will want to check the value of a particular 
attribute as shown below.


   WHERE AttributeDefinition.Name =  ' Modification  Date'
AND Attribute.Value = :mod_date

This query will use the AttributeValue index to reduce the rows scanned 
to only those that hold values of this attribute. In this case each file 
has a mod date so it will search 25K rows. For a query that is looking 
for a attribute like 'Album Name' which only exists for your 1K music 
files, it will only scan those 1K rows checking for album name matches.
 
In some cases you may want to look in a set of attributes, which you can 
do as shown below.


   WHERE AttributeDefinition.Name IN ( ' Modification  Date', 'Create 
Date', 'Backup Date')

AND Attribute.Value = :date

The possibilities are, quite literally, endless. If you have some sample 
queries that you need to make  please tell me what they are so we can 
talk about more concrete examples.


HTH
Dennis Cote




Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Tito Ciuro

Hi Denis,

I've been reading your email carefully and I'd like to comment it.

On 28/03/2006, at 14:24, Dennis Cote wrote:

With these tables you will have 25K rows in the File table, one per  
file, and 250K rows in the Attribute table assuming an average of  
10 attributes per file (your example had 14).


An index on the attribute name and value pairs will greatly speedup  
lookups of a particular attribute.


There is still some redundancy in this set of tables since the  
names of the attributes are being repeated over and over in the  
Attribute table. This may not be a big problem, but you really only  
have 100 attributes, so there is no sense storing 25K copies of the  
name of a common attribute that applies to all files. This  
information can be factored out into a third table as shown below.


[...] This will give you three tables (which will not cause a  
noticeable difference in the database open time). One with 25K rows  
of file names. One with 100 rows of attribute definition data,  
which includes the attribute name.  And one with  250K rows of  
attribute value data.


Due to application requirements, I must rely on LIKE and GLOB in  
order to match data, matching data that contains some value  
(sensitive or insensitive match).


Now, it seems to me that using either LIKE or GLOB will force a row  
scan anyhow, since it can't use the index, correct? So your solution  
would force me to perform the row scan for 250K rows when matching  
values. That is 10 times more data to scan through.


I'm not sure if there is a way to speed up LIKE or GLOB queries such as:

SELECT myvalue FROM files WHERE myvalue LIKE '%finit%';

Thanks a lot for the advice.

Regards,

-- Tito


Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Doug Currie
Tito,

I am replying off list because I don't want to contribute to turning
this thread into a flame war...

You have shown wonderful patience and politeness on this list. You
have my gratitude and respect for this.

Regards,

e

Tuesday, March 28, 2006, 5:35:37 PM, Tito Ciuro wrote:

> Hi Dennis,

> On 28/03/2006, at 14:24, Dennis Cote wrote:

>> If you give this a try, you might be surprised at how flexible it is.

> That was a very clear explanation. I will sure give it a try when I
> have a chance.

>> Just because something works doesn't mean that it can't be improved.

> I couldn't agree more with you.

> Thanks for the help!

> Regards,

> -- Tito



-- 
Doug Currie
Londonderry, NH



RE: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Fred Williams
Very well thought out response.

Now perhaps we will see if various people here attempting to help have
been, as we say in east Texas, "Talking to a fence post." :-)

Fred

> -Original Message-
> From: Dennis Cote [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 28, 2006 4:24 PM
> To: sqlite-users@sqlite.org
> Subject: Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with
> numbers?
>
>
> Tito Ciuro wrote:
>
> >
> > I have no idea why you're so angry. Anyway, there are so
...
> While the tone of MGC's post may have been a little over the
> top, many
> of his points were valid.
>
> All he was really suggesting is that a set of normalized
> tables would do
> the same thing, and probably preform better, especially when
> scaled to
> larger tables.
...



Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Dennis Cote

Tito Ciuro wrote:



I have no idea why you're so angry. Anyway, there are so many things  
I can think of saying, I'll just make it brief and to the point.


This thing won't scale. I'd like to see it when you have the 4.5  
million records my database contains,

and that is still tiny for all intents-and-purposes.



For the type of application I'm building, it doesn't need to scale.  
At most I'll have 25K records. Even with the 100K database I've used  
for testing, it simply flies.


I'm not building anything for the outside world. Just something that  
serves me well.


I'm storing variable-length data, with a very different set of  
attributes. Some may have 1 attribute, others tens of them. Perhaps  
even a hundred. Using a column per attribute is not a good idea. A  
few days ago I asked this question and Dr. Hipp mentioned:



The more tables you have, the slower the first query will run
and the more memory SQLite will use.  For long-running applications
where the startup time is not a significant factor, 100s or
1000s of tables is fine.  For a CGI script that starts itself
up anew several times per second, then you should try to keep
the number of tables below a 100, I think.  Less than that if
you can. You should also try and keep down the number of tables
in low-memory embedded applications, in order to save on memory
usages.  Each table takes a few hundred bytes of memory - depending
on the number of columns and features.



Having two columns (one for the key and the other one for the data  
itself) seems like a good balance between speed and ease of use. I  
don't care if it doesn't scale because the intended deployment is 25K  
at most, as I said earlier. Even with data sets 4x that size works fine.


There is absolutely no reason this could not be properly designed  
and STILL fit any possible need for that ugly packed record.




Tito,

While the tone of MGC's post may have been a little over the top, many 
of his points were valid.


All he was really suggesting is that a set of normalized tables would do 
the same thing, and probably preform better, especially when scaled to 
larger tables.


In your application you have 25K files each with 1 to 100 attributes. 
You can still store your data in two normalized tables.


The main idea is to use more than one row in the value table for each 
file. These values are all tied to the file name by a common file id 
value. This table schema is shown below.


   -- basic two table attribute setup

   create table File (
   Id  integer primary key;
   Nametext;   -- name of file 
including path

   );

   create table Attribute (
   FileId  integer references File(Id); -- foreign key
   Nametext;   -- name of attribute
   Value   text;   -- value of attribute 
   primary key (FileId, Name);  
   );


   create index AttributeValue on Attribute(Name, Value);

With these tables you will have 25K rows in the File table, one per 
file, and 250K rows in the Attribute table assuming an average of 10 
attributes per file (your example had 14).


An index on the attribute name and value pairs will greatly speedup 
lookups of a particular attribute.


There is still some redundancy in this set of tables since the names of 
the attributes are being repeated over and over in the Attribute table. 
This may not be a big problem, but you really only have 100 attributes, 
so there is no sense storing 25K copies of the name of a common 
attribute that applies to all files. This information can be factored 
out into a third table as shown below.



   -- Factor out the attribute definitions into a separate table

   create table File (
   Id  integer primary key;
   Nametext;   -- name of file 
including path

   );

   create table AttributeDefinition (
   Id  integer primary key;
   Nametext unique;-- name of the 
attribute  
   Typetext;   -- type of the attribute

   );

   create table Attribute (
   FileId  integer references File(Id);
   AttId   integer references AttributeDefinition(Id);
   Value   text;   -- value of attribute 
   primary key (FileId, AttId);  
   );


   create index AttributeValue on Attribute(AttId, Value);

This will give you three tables (which will not cause a noticeable 
difference in the database open time). One with 25K rows of file names. 
One with 100 rows of attribute definition data, which includes the 
attribute name.  And one with  250K rows of attribute value data.


The third table assigns each attribute an id number which will fit in a 
single byte, and that id is used to replace the 25K copies of the 
attribute name. This will be a sizable reduction in stored data for all 
but the shortest names. It also allows other definition data to be 

Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Tito Ciuro

John,

Did you read my replies at all? If not, please take 15 seconds to do  
so. I thanked *all of you* in all three emails.


Just in case, if that wasn't enough, allow me to do that for the  
fourth time: I most sincerely appreciate the time and effort that you  
guys have taken to answer my emails. I really do appreciate it your  
help.


Best regards,

-- Tito

On 28/03/2006, at 13:45, John Stanton wrote:


Tito,

Several people have bothered to spend the time and effort to give  
you some well considered help.  That is what forums like this are  
for, professional guidance and development.  Thank them.


Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread John Stanton

Tito,

Several people have bothered to spend the time and effort to give you 
some well considered help.  That is what forums like this are for, 
professional guidance and development.  Thank them.


Tito Ciuro wrote:

MGC,

I have no idea why you're so angry. Anyway, there are so many things  I 
can think of saying, I'll just make it brief and to the point.


1) Regarding your statement:

This thing won't scale. I'd like to see it when you have the 4.5  
million records my database contains,

and that is still tiny for all intents-and-purposes.



For the type of application I'm building, it doesn't need to scale.  At 
most I'll have 25K records. Even with the 100K database I've used  for 
testing, it simply flies.


I'm not building anything for the outside world. Just something that  
serves me well.


2) Regarding:


Absolutely NO ONE  suggested moving anything out of "SQLite-land".



What!?!? You mentioned it two days ago!:


Stuff it into a sorted flat file.
that would be faster and simpler.



3) Regarding your statement:

As to your 'real good reason' for doing it this way,  I'd bet cash  
money it's crap and based on nothing more than 'Because that's the  
way I decided to do it, and I'm smart'.



Talk about making things up... you're a funny guy :-)

I'm storing variable-length data, with a very different set of  
attributes. Some may have 1 attribute, others tens of them. Perhaps  
even a hundred. Using a column per attribute is not a good idea. A  few 
days ago I asked this question and Dr. Hipp mentioned:



The more tables you have, the slower the first query will run
and the more memory SQLite will use.  For long-running applications
where the startup time is not a significant factor, 100s or
1000s of tables is fine.  For a CGI script that starts itself
up anew several times per second, then you should try to keep
the number of tables below a 100, I think.  Less than that if
you can. You should also try and keep down the number of tables
in low-memory embedded applications, in order to save on memory
usages.  Each table takes a few hundred bytes of memory - depending
on the number of columns and features.



Having two columns (one for the key and the other one for the data  
itself) seems like a good balance between speed and ease of use. I  
don't care if it doesn't scale because the intended deployment is 25K  
at most, as I said earlier. Even with data sets 4x that size works fine.


There is absolutely no reason this could not be properly designed  and 
STILL fit any possible need for that ugly packed record.



So you know it all, eh? And you call me arrogant? :-)

I'm very happy for the solution: it's speedy and is simple. As for  the 
original question I posted, I'm also glad to report that LIKE and  GLOB 
works fine.


Cheers,

-- Tito




Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Tito Ciuro

MGC,

I have no idea why you're so angry. Anyway, there are so many things  
I can think of saying, I'll just make it brief and to the point.


1) Regarding your statement:

This thing won't scale. I'd like to see it when you have the 4.5  
million records my database contains,

and that is still tiny for all intents-and-purposes.


For the type of application I'm building, it doesn't need to scale.  
At most I'll have 25K records. Even with the 100K database I've used  
for testing, it simply flies.


I'm not building anything for the outside world. Just something that  
serves me well.


2) Regarding:


Absolutely NO ONE  suggested moving anything out of "SQLite-land".


What!?!? You mentioned it two days ago!:


Stuff it into a sorted flat file.
that would be faster and simpler.


3) Regarding your statement:

As to your 'real good reason' for doing it this way,  I'd bet cash  
money it's crap and based on nothing more than 'Because that's the  
way I decided to do it, and I'm smart'.


Talk about making things up... you're a funny guy :-)

I'm storing variable-length data, with a very different set of  
attributes. Some may have 1 attribute, others tens of them. Perhaps  
even a hundred. Using a column per attribute is not a good idea. A  
few days ago I asked this question and Dr. Hipp mentioned:



The more tables you have, the slower the first query will run
and the more memory SQLite will use.  For long-running applications
where the startup time is not a significant factor, 100s or
1000s of tables is fine.  For a CGI script that starts itself
up anew several times per second, then you should try to keep
the number of tables below a 100, I think.  Less than that if
you can. You should also try and keep down the number of tables
in low-memory embedded applications, in order to save on memory
usages.  Each table takes a few hundred bytes of memory - depending
on the number of columns and features.


Having two columns (one for the key and the other one for the data  
itself) seems like a good balance between speed and ease of use. I  
don't care if it doesn't scale because the intended deployment is 25K  
at most, as I said earlier. Even with data sets 4x that size works fine.


There is absolutely no reason this could not be properly designed  
and STILL fit any possible need for that ugly packed record.


So you know it all, eh? And you call me arrogant? :-)

I'm very happy for the solution: it's speedy and is simple. As for  
the original question I posted, I'm also glad to report that LIKE and  
GLOB works fine.


Cheers,

-- Tito


Re: [sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread m christensen



Tito Ciuro wrote:



Oh!... just a quick message for the row-scan non-believers out there:  
SQLite flies, even under this scenario. I'm getting wonderful  
performance numbers by keeping everything within SQLite-land. The  
code is simple and I let SQLite do all the magic. What else can I ask  
for? :-)



If only you would listen enough to understand how clueless this 
statement is


You are clearly too arrogant to even listen to the reasons you have 
built and will continue to build

poor relational database designs.

This thing won't scale. I'd like to see it when you have the 4.5 million 
records my database contains,

and that is still tiny for all intents-and-purposes.

Absolutely NO ONE  suggested moving anything out of "SQLite-land".

It was pointed out that:

You have defeated the purpose even using a relational database, this 
thing is just a data dump.

You have neglected the most basic rules of design.
You will never have the advantage of using an index.
You are forcing the database to read and parse every single record from 
that table every single time you

select a set of records.

If you refuse to understand why it is better to design a properly 
normalized database that will find and return a single record
from a table with 4.5 million rows in 2 dozen disk reads, or fewer if 
properly cached, rather than to force it to
read and parse every single one of those 4.5 million rows, you deserve 
what you get.


As to your 'real good reason' for doing it this way,  I'd bet cash money 
it's crap and based on nothing more than

'Because that's the way I decided to do it, and I'm smart'.

There is absolutely no reason this could not be properly designed and 
STILL fit any possible need for that ugly

packed record.

MGC





[sqlite] [SOLVED] Re: [sqlite] LIKE and GLOB bug with numbers?

2006-03-28 Thread Tito Ciuro

Hi Martin,

I haven't been able to pick up pace my email until this morning.  
Sorry about that.


As it turns out, the error was on my side (surprise). The code seemed  
to be working fine: early tests showed data being churned as  
expected, so I started cleaning it up and factoring up the code. To  
make the story short, I accidentally deleted the line that was  
incrementing the index keeping in sync with the array enumerator.  
Result: I was always storing element zero in the database. No wonder  
I had 4000+ entries looking exactly the same! The first version of  
the code was working fine, so I assumed that LIKE and GLOB were there  
culprits after I started noticing weird results.


Oh!... just a quick message for the row-scan non-believers out there:  
SQLite flies, even under this scenario. I'm getting wonderful  
performance numbers by keeping everything within SQLite-land. The  
code is simple and I let SQLite do all the magic. What else can I ask  
for? :-)


Thanks everybody for your help and comments.

Regards,

-- Tito

On 27/03/2006, at 3:40, Martin Jenkins wrote:


Tito,

I knocked up a quick test with python and apsw and it worked as  
intended. My data isn't exactly the same as yours in that I don't  
have the variety in the keys, but you're not having problems with  
those. My test database contains your data with/without embedded  
carriage returns - as expected, it makes no difference.


In the following, zip(..) is a quick hack to get all the results  
from the query. The spurious '[', ']' and other brackets  
surrounding the results are a result of the way that apsw returns  
data (as lists of python tuples).


Apologies for the extreme width of the following lines. :(

zip(csr.execute("select * from t"))

[
(('file5809', '(0,NSFileTypeRegular,0,22537,0,staff, 
234881026,294022,2004-12-16 10:11:00 -0800,tciuro,384,2006-03-26  
08:01:55 -0800,502,20)'),),
(('file0101581a', '(1,NSFileTypeRegular,1,22554,0,staff, 
234881026,294022,2004-12-16 10:11:03 -0800,tciuro,384,2006-03-26  
08:04:55 -0800,502,20)'),),
(('file0202582b', '(2,NSFileTypeRegular,2,22571,0,staff, 
234881026,294022,2004-12-16 10:11:06 -0800,tciuro,384,2006-03-26  
08:07:55 -0800,502,20)'),),

...
(('file595d', '(\n   0,\nNSFileTypeRegular,\n0,\n  
22877,\n0,\nstaff,\n234881026,\n294022,\n 
2004-12-16 10:11:00 -0800,\ntciuro,\n384,\n2006-03-26  
08:01:55 -0800,\n 502,\n20\n)'),),
(('file0101596e', '(\n   1,\nNSFileTypeRegular,\n1,\n  
22894,\n0,\nstaff,\n234881026,\n294022,\n 
2004-12-16 10:11:03 -0800,\ntciuro,\n384,\n2006-03-26  
08:04:55 -0800,\n 502,\n20\n)'),), (('file0202597f',
'(\n   2,\nNSFileTypeRegular,\n2,\n22911,\n0,\n 
staff,\n 234881026,\n294022,\n2004-12-16 10:11:06 -0800, 
\ntciuro,\n 384,\n2006-03-26 08:07:55 -0800,\n502,\n 
20\n)'),),

...
]

zip(csr.execute("SELECT * FROM t WHERE CMValues GLOB '*2004-12-16  
10:11:45 -0800*'"))


[
(('file15155908', '(15,NSFileTypeRegular,15,22792,0,staff, 
234881026,294022,2004-12-16 10:11:45 -0800,tciuro,384,2006-03-26  
08:46:55 -0800,502,20)'),),
(('file15155a5c', '(\n   15,\nNSFileTypeRegular,\n15,\n  
23132,\n0,\nstaff,\n234881026,\n294022,\n 
2004-12-16 10:11:45 -0800,\ntciuro,\n384,\n2006-03-26  
08:46:55 -0800,\n 502,\n20\n)'),)

]

zip(csr.execute("SELECT * FROM t WHERE CMValues LIKE '%2004-12-16  
10:11:45 -0800%'"))


[
(('file15155908', '(15,NSFileTypeRegular,15,22792,0,staff, 
234881026,294022,2004-12-16 10:11:45 -0800,tciuro,384,2006-03-26  
08:46:55 -0800,502,20)'),),
(('file15155a5c', '(\n   15,\nNSFileTypeRegular,\n15,\n  
23132,\n0,\nstaff,\n234881026,\n294022,\n 
2004-12-16 10:11:45 -0800,\ntciuro,\n384,\n2006-03-26  
08:46:55 -0800,\n 502,\n20\n)'),)

]

Could you try reducing your search strings and see if there's a  
point at which they start working?


HTH,

Martin Jenkins
XQP Ltd
Ascot, UK

- Original Message - From: "Tito Ciuro" <[EMAIL PROTECTED]>
To: "Forum SQLite" 
Sent: Sunday, March 26, 2006 6:50 PM
Subject: [sqlite] LIKE and GLOB bug with numbers?



Hello,

I've populated a datafile with 40.176 records which contain file  
attributes and file paths. I have two columns, CMKey and  
CMValues.  The column CMKey contains the path to the file and the  
column  CMValues contains the attribute values. For example:


CMKey: Application Support/AbiSuite/AbiWord.Profile

CMValues:
(
0,
NSFileTypeRegular,
1,
21508,
0,
staff,
234881026,
294022,
2004-12-16 10:11:35 -0800,
tciuro,
384,
2006-03-26 08:35:55 -0800,
502,
20
)

Both columns are of type TEXT.

This is what I've found:

1) SELECT * FROM FinderFiles WHERE CMKey GLOB '*AbiWord.Profile*'   
returns 1 match. This is correct.


2) SELECT *