Re: [sqlite] Is it possible to optimize this query on a very large database table
Igor Tandetnik, >> How come you only show one function? A user-defined aggregate function is >> actually represented by two C[++] functions - one that is called for every >> row >> and performs actual aggregation, and another that's called at the end of >> each >> group, reports the result and resets the state machine to prepare for the >> next >> group. You can use sqlite3_context to store state between invocations - see >> sqlite3_aggregate_context. We have defined 2 C++ function XStep and XFinalize(shown below). The group by last name BLOB results look accurate. Thank you for your help. void cIntersectingGroupCache::XFinalize(sqlite3_context *context){ listCtx *p; char *buf=NULL; buf = (char *) malloc ((sizeof(int) * ((cIntersectingGroupCache*)(p->TheThis))->Column2.size())+ 4); if (buf == NULL) printf("malloc error in XFinalize, buf\n"); sqlite3_result_blob(context,buf, (((cIntersectingGroupCache*)(p->TheThis))->Column2.size()*sizeof(int)) + 4, free); ((cIntersectingGroupCache*)(p->TheThis))->Column2.clear(); } ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Is it possible to optimize this query on a very large database table
Frank Chang wrote: >The explain query plan for select FieldName, min(rowid) from > BlobLastNameTest group by FieldName shows a full index scan. Of course. How else do you expect to be able to look at every FieldName? You seem to be expecting some kind of a "compressed" index - an index with as many entried as there are distinct values of FieldName, each entry pointing to... I'm not sure what, exactly. There ain't no such thing. An index on FieldName still has as many entries as there are rows in the underlying table - it's just sorted by FieldName. To get a list of all distinct values of FieldName, SQLite has to scan this index, and simply discard any value that is equal to that from previous row. If you think you need such a "compressed index", you would have to maintain it yourself, as a separate table. Personally, based on the description of your problem, I don't think you need any such thing. > Here is how I might do the update: > > 1. CREATE TABLE FOO(FIELDNAME CHAR(25), IDROW INT); > 2 INSERT INTO FOO select FieldName, min(rowid) from BlobLastNameTest > group by FieldName > 3. INSERT OR REPLACE INTO BLOBLASTNAMETEST SELECT t1.FIELDNAME, > UDF(t1.ROWID,t1.FIELDNAME,this,'BLOBLASTNAMETEST') FROM > FOO WHERE BLOBLASTNAMETEST.FIELDNAME = FOO.FIELDNAME AND > BLOBLASTNAMETEST.ROWID = FOO.IDROW. This last query makes no sense to me. It refers to BLOBLASTNAMETEST in WHERE clause though it was never mentioned in FROM clause. It refers to identifiers "t1" and "this" that were never declared. How about this? Drop steps 1 and 2, and run this statement instead: insert or replace into BlobLastNameTest(rowid, FieldName, Vertices) select min(rowid), FieldName, MyAccumulation(Vertices) from BlobLastNameTest group by FieldName; where MyAccumulation is a custom aggregate function that does whatever you mean by "accumulate". This way, you should be able to do everything in a single pass. See http://sqlite.org/c3ref/create_function.html , the description of xStep and xFinal parameters, for an explanation of how to set up a custom aggregate function. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Is it possible to optimize this query on a very large database table
Igor Tandetnik, The explain query plan for select FieldName, min(rowid) from BlobLastNameTest group by FieldName shows a full index scan, even after I run ANALYZE TABLE AND INDEX ON FIELDNAME, which could be problematic when the number of rows in BlobLastNameTest increases by a factor 10 to 30 million rows. But since Sqlite does not have a DISTINCT ON clause as Florian Weimer as pointed out two days ago,we will have to investigate this issue in another possible thread. Here is how I might do the update: 1. CREATE TABLE FOO(FIELDNAME CHAR(25), IDROW INT); 2 INSERT INTO FOO select FieldName, min(rowid) from BlobLastNameTest group by FieldName 3. INSERT OR REPLACE INTO BLOBLASTNAMETEST SELECT t1.FIELDNAME, UDF(t1.ROWID,t1.FIELDNAME,this,'BLOBLASTNAMETEST') FROM FOO WHERE BLOBLASTNAMETEST.FIELDNAME = FOO.FIELDNAME AND BLOBLASTNAMETEST.ROWID = FOO.IDROW. WHERE THE UDF looks like this: void cIntersectingGroupCache::UDFFunc(sqlite3_context *context, int argc, sqlite3_value **argv){ char* TableName; int size; int* ip2; long long int iVal; ((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.clear(); switch( sqlite3_value_type(argv[0]) ){ case SQLITE_INTEGER: { iVal = sqlite3_value_int64(argv[0]); iVal = ( iVal > 0) ? 1: ( iVal < 0 ) ? -1: 0; sqlite3_result_int64(context, iVal); break; } case SQLITE_NULL: { size = sqlite3_value_bytes(argv[3]); TableName = new char[size + 1]; memcpy(TableName, sqlite3_value_text(argv[3]),size); TableName[size] = '\x0'; ((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->RowCountMap[TableName] += 1; delete [] TableName; break; } case SQLITE_BLOB: { size = sqlite3_value_bytes(argv[0]); ip2 = (int *)sqlite3_value_blob(argv[0]); for (int i = 0; i < size/sizeof(int); i++){ ((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.push_back(ip2[i]); } break; } default: { break; } } switch( sqlite3_value_type(argv[1]) ){ case SQLITE_INTEGER: { int iVal = sqlite3_value_int(argv[1]); ((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.push_back(iVal); char* blob = reinterpret_cast(&(((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column[0])); sqlite3_result_blob(context,blob,((cIntersectingGroupCache*)(sqlite3_value_int(argv[2])))->Column.size()*sizeof(int),NULL); break; } default: { break; } } } Thank you for your help. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Is it possible to optimize this query on a very large database table Oct. 10, 2011 13:53:01 GMT
Igor Tandetnik, >>> Explain the problem you are trying to solve, *not* your proposed solution. >>> <<< What we are trying to achieve is to to find the minimum row id for each unique Field Name in BLobLastNameTest where many rows can have the same FIELDNAME but distinct BLOBS(Vertices Column). Once we know the first/minimum row id of each unique fieldname group, we would like to write a sqlite UPDATE Statement to accumulate all the BLOB vertices of all the rows with that unique fieldname into the first(i.e MIN(ROWID)) row's BLOB(Vertices)column for each unique fieldname group. The reason we can't use select FieldName, rowid from BlobLastNameTest is that it would slow our C++ Windows/LINUx/Solaris UNIX worker threads so much that it wouldn;t be worth multithreading this whole process. We were hoping that the SQLITE query processor and the appropriate indexes could accomplish these previous 2 steps in a few minutes for a reasonable number of BLOBLASTNAMETEST rows. Thank you for all of your help. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Is it possible to optimize this query on a very large database table.
Frank Chang wrote: > Igor Tandetnik, > The fieldname groups in our BlobLastNameTable consist of > multiple rows where each pair of columns [FieldName, > BLOB[Vertices]] is unique. How so? You have FieldName declared as PRIMARY KEY. From your original post: CREATE TABLE [BlobLastNameTest] ([FieldName] CHAR (25) PRIMARY KEY, [Vertices] BLOB ) PRIMARY KEY means there can only be one row with any given value of FieldName. If you have since changed your schema, then post the new CREATE TABLE statement for BlobLastNameTest table, and any CREATE INDEX statements related to it. > Therefore, every fieldname group does not just have a single row but instead > 1000's or > 1's rows. So that is why we use a group by/order by and > subselect clause to locate the first/minimum row id row > in each fieldname group. Assuming this is true, what's wrong with select FieldName, min(rowid) from BlobLastNameTest group by FieldName; > Once we know the first/minimum row id of each unique fieldname group, we > would lke to write a > sqlite UPDATE Statement to accumulate all the BLOB vertices of all the rows > with that unique fieldname into the first(i.e > MIN(ROWID)) row's BLOB(Vertices)column for each unique fieldname group. What do you mean by "accumulate"? Concatenate? I don't think you can do that with SQL alone - you'll have to write some code. Personally, I'd do something like this (in pseudocode): stmt = prepare("select FieldName, rowid, Vertices from BlobLastNameTest order by FieldName, rowid") currentFieldName = ""; firstRowId = -1; blob = "" while (stmt.Step) { if (currentFieldName != stmt.FieldName) { commitBlob(firstRowId, blob) currentFieldName = stmt.FieldName firstRowId = stmt.rowid blob = "" } blob += stmt.Vertices // whatever you mean by "accumulate", do it here } commitBlob(firstRowId, blob) function commitBlob(rowid, blob) { if (rowid > 0) { execute "update BlobLastNameTest set Vertices = ? where rowid = ?;" with parameters (blob, rowid) } } > Then we would like to discard all the rows in each > fieldname group of rows that have an rowid different from the first row That one's easy: delete from BlobLastNameTest where rowid != (select min(rowid) from BlobLastNameTest t2 where t2.FieldName = BlobLastNameTest.FieldName); -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Is it possible to optimize this query on a very large database table.
Igor Tandetnik, The fieldname groups in our BlobLastNameTable consist of multiple rows where each pair of columns [FieldName, BLOB[Vertices]] is unique. Therefore, every fieldname group does not just have a single row but instead 1000's or 1's rows. So that is why we use a group by/order by and subselect clause to locate the first/minimum row id row in each fieldname group. Once we know the first/minimum row id of each unique fieldname group, we would lke to write a sqlite UPDATE Statement to accumulate all the BLOB vertices of all the rows with that unique fieldname into the first(i.e MIN(ROWID)) row's BLOB(Vertices)column for each unique fieldname group. Then we would like to discard all the rows in each fieldname group of rows that have an rowid different from the first row(i.e MIN(rowid)). Because we using a C++ WINDOWS/LINUX/Solaris UNIX multithreaded program where each concurrent worker thread has its own sqlite database and sqlite table and sqlite index , we would like each concurrence worker thread to run as fast as possible on a multicore CPU. We have profiled the worker threads and we have found that the sqlite statements are the bottleneck. So, that is why we would like our queries to run as fast as possible by avoiding full index scans as Florian Weimer pointed out in the post about Postgres' DISTINCT ON feature which we are trying to simulate on Sqlite. I hope I have provided you more information. I did try your suggestion: select FieldName, rowid from BlobLastNameTest. However, it generates 5.5 million rows of output which would force our C++ Worker threads to do the BLOB(Vertices) aggregation. Unfortunately , that would would take too long so we were hoping that we could devise a nice query to let the SQLite query processor do all the Blob(Vertices) aggregation in a few minutes or less . Please let me know if you have any questions or suggestions. Thank you for your help. -- -- I take it back - the behavior of your query is well-defined, for the simple reason that FieldName is unique, so every group only has a single row in it (as someone else has kindly pointed out - I missed this detail on the first reading). For that same reason, the whole GROUP BY and sub-select dance is completely pointless. Your query is just a very elaborate and wasteful way to write select FieldName, rowid from BlobLastNameTest; ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users